Unlocking the full potential of data science through scalability

Johannesburg, 10 Feb 2023

The past few decades have seen explosive growth in data, AI and machine learning technologies. Following decades of development, from the earliest working AI programs written in the 1950s running on minuscule compute power compared to the standard smartphone of today. These algorithms were innovative and set the scene for the next few decades development and evolution but couldn't unlock the scale of business value we see in companies such as Microsoft, Amazon and Tesla today. Top of mind for business leaders is: How do I extract value from this thing called AI and machine learning?

We can compare Data Science @ Scale to the retail industry: Due to the slim margins on most products in the retail industry, companies recognise that the key to sustainability is through pushing high volumes of sales. In a similar way, value from data science where we use AI and ML algorithms to solve a business problem, becomes extremely valuable to business if it can be executed at scale.

What is Data Science @ Scale

Let's begin by defining what data science is: A data scientist operates at the intersection of mathematics, domain expertise, and computer engineering, and must be able to articulate complex problems in understandable terms as well as business value.

Scaling data science implies that every business decision is powered in some way by analytics and predictive modelling. Billions of business decisions are made every year: According to Transunion roughly 8 million South Africans were looking to apply for some form of credit in 2022. The point of application alone therefore requires at least 8 million credit decisions to be made. In addition, decisions related to credit lifecycle such as behavioural aspects and bad debts must also be taken care of.

The National Payments System in South Africa processes roughly R600 billion, every single day. The number of transactions driving that amount would be staggering. One of the biggest use cases of predictive analytics in all financial services institutions today relate to fraud detection on these transactions.

Google gives us a good example of executing data science at scale: Research shows that in 2020 alone the company ran through about 600 000 experiments for changes to their search algorithms, and implemented roughly 4 500 improvements in the same year, equating to roughly 18 improvements per day over a year.

Running data science at scale and extracting value at scale hinges on having the necessary setup in place. While discussions about machine learning algorithms dominate, there are several other crucial aspects to consider, such as feature engineering, prediction serving infrastructure, and monitoring the relevance of predictions.

What companies need to consider in order to execute Data Science @ Scale

Flexible and Scalable Compute:

Are you leveraging the latest in infrastructure such as Azure cloud services allowing scalable compute? Cloud services have enabled flexibility in allocating required resources unseen before.

Endpoints for Fast Operationalisation of results:

Being able to build models on compute that gives results in a short amount of time is key. Once the models have been trained though, getting these results to be consumable by other operational systems and applications is critical. The integration of API thinking and the adoption of APIs have simplified the process of exposing machine learning algorithms as endpoints that can be consumed as REST APIs. Moving towards an integration layer that allows for this type of rapid deployment will drive scale delivery.

Identity and Access Management:

Data governance is very topical in data science for a reason. Working with sensitive information such as personal identifiable information is normally encountered by anyone in this discipline. Doing the best for managing access to sensitive information MUST be a priority.

Collaboration across different tools:

Traditional analytical toolsets have been set up in a way where collaborating across different tools was difficult, if not impossible. Today, coding in a notebook environment can easily allow for a user to write code in different coding-languages and run these successfully. Sharing across teams with tools such as Azure Databricks, Azure ML and Azure Synapse is seamless and takes away the element of having to send code to one another. Versions are never aligned, and many discrepancies result from this approach, which is why looking at a toolset to aid collaboration is key. Collaborating also allows individuals to quickly brainstorm and improve on implementations.

Logging and Tracking of Experiments:

As mentioned, data science is the application of scientific techniques to data, and part of this is running numerous experiments, as evident from the Google case study. One key element of running the data science discipline at scale is the ability to do numerous experiments. However, and probably even more important, is the ability to log and track these experiments. Experience shows that there is often a need to go back to an experiment that was run a few months or years back, and finding those results can be difficult. Being able to quickly reference all previous runs of the models, with the metadata about the run, such as accuracy metrics is key to being able to reproduce and repeat results.

Continuous Improvement, Development and Delivery:

Similar to the principles of DevOps, MLOps provides a framework for continuous delivery of improvements. The implementation of CI/CD in machine learning, along with the right people, processes, and technology, can lead to significant value creation. Faster delivery is enabled, better governed releases and proper testing all require a framework such as MLOps to ensure continuous improvements.

Well Governed and Managed Feature Store:

Machine Learning models are only as good as the data that gets fed into the process. Added to that, more accurate models often materialise only when new types of data feeds into the model building process. Treating these features that are provided to models like a well-managed IT asset can be very beneficial. As with IT assets, managing the lifecycle of a feature by aligning it to the business strategy, developing and implementing the feature, improving on the feature, but then also deprecating the feature if it’s no longer of use anymore will create efficiencies and unlock value. A curated Feature store will cater for a catalogue to create visibility of the features available, processing infrastructure to manipulate big datasets, and serving infrastructure to enable consumption of the data for model building and production systems. Additionally treating the features as data products and assigning an owner to the product who ensures usage, quality and usability will drive benefits to a new level.

Maximising the potential of data can undoubtedly unlock business value and unveil novel opportunities. Numerous examples in the literature demonstrate how businesses, both established and emerging, that quickly transformed digitally outperformed their competitors financially, while some well-known brands were relegated to mere memories. Research has consistently shown that companies that adopt a data-driven approach to decision making tend to achieve greater long-term profitability. Data science provides a segue into the world of using predictive and prescriptive analytics to drive decisions. The benefit however is often only unlocked when driving these concepts at a very large scale. Our capabilities in driving scale operations in machine learning have helped companies drive their strategic objectives to a new level and can help yours in the same way.

Unlocking the full potential of data science through scalability

Scaling up data science: how to drive business value and gain a competitive edge.

What is Data Science @ Scale

What companies need to consider in order to execute Data Science @ Scale