Recap.

We went through a non-exhaustive list of requirements for a good data platform, that we used to shortlist two solutions for a POC: Databricks and AWS Sagemaker.

In the part 1, I introduced our journey towards the implementation of a Data Science and Analytics platform. I explained that a data driven company needs to consider many aspects, from hiring good talent to investing in a new data platform.

Databricks is a software platform that helps its customers unify their analytics across the business, data science, and data engineering. It provides a Unified Analytics Platform for data science teams to collaborate with data engineering and lines of business to build data products. Find more details here.

Amazon SageMaker is a fully managed machine learning service. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environments. Find more details here. This was already available internally, as CTM uses AWS as cloud solution.

POC ran for a duration of a month, during which we assessed the functionalities of both solutions, and validated against each other as well as the current environment where relevant.

Methodology

The POC was divided into 2 main parts:

Architecture & devops assessment.

End to end testing.

Architecture & devops assessment

In this part, the focus was on the platform deployment and administration. We created an isolated AWS account, identical to the main account we use for our daily tasks. We then went ahead with the deployment of Databricks, which we found straightforward. The tests were evaluated against the following categories:

Deployment: How easy it is to deploy Databricks within AWS.

Administration: What features are available for the platform admin, how effective they are.

Tools & Features: Are the available tools capable of covering all our daily tasks.

Performance: Query performance, job performance.

Integration with external services.

End-to-end testing

This is where the solution has been tested in much finer detail, developing, and productionising a Machine Learning model with Databricks and Sagemaker.

Since Databricks casts a wider net than just Machine Learning applications, we have arranged for a 2-day Hackathon that involved various teams within the Data Function to go over a scripted task list, predefined by the representatives of each team (insights, Analytics, Data Science, etc).

This part was evaluated using a scorecard that rolled up into various categories such as:

Productivity & Workspace: ease of use, platform performance, stability of the environment.

Collaboration: Collaboration with other users, sharing results and dashboards.

Analytics: Data manipulation, visualisation, and data export.

Data Science: machine learning lifecycle management.

Note: List above is not an exhaustive one, just a high-level overview

Each team member was expected to score various tasks under each category. These were then discussed, to understand the reasons behind them and averaged where relevant to get an idea of which solution was preferred by the team. (Databricks, Sagemaker or current way).

Implementing a new data science and analytics platform Part 2

Recap.

Methodology

Architecture & devops assessment

End-to-end testing

Suggested articles

Great Hiring Practices

The camaraderie of the people that work here is infectious

Be alert to possibilities