The current state of data

By Julian Thomas, Principal consultant at PBT Group

Johannesburg, 08 Nov 2019

As I sit here, thinking of the current state of data, I can’t help but think (and being quite chuffed), that it is going rather well.

A constantly increasing number of positions are being advertised for data scientists and engineers. This implies a constantly growing focus on the importance of data, and the disciplines of data science and engineering.

However, sadly, eventually the cold light of reality dawns on me, where I must acknowledge these are peripheral changes ‒ indicating only that there is an increase in hiring behaviour. What I am, unfortunately, not seeing much of is an increasing maturity in the process of data science, nor an increase in the synergy and co-operation between data science and IT.

Let me elaborate a bit further. Leading research firms estimate that as much as 85% of big data and analytics projects fail to deliver. Lack of suitable skills certainly plays a part in this scary statistic. Another reason is that many of the resulting models are just not implementable, from either a business or a technological perspective.

For example, a predictive model might give the business insight, such as which customers will pay off their loan early; however, the business might not be able to implement a counter-measure. How does the business prevent a customer from settling their loan early? This example illustrates the lack of thorough evaluation ‒ does the cost or impact of the solution outweigh the cost of the problem? Is a solution even possible, or practical?

From a technology perspective, we often find practical challenges in implementation. For example, a model might have been built on static data, in a batch environment. In real life, the data becomes available in real-time, and at the point of entry into the business process, the company might not have all the data available that the model was built on.

Sadly, I find IT to still be quite restrictive, taking much more of a governance approach, and not enough of a co-operative solution enablement approach.

Therefore, it may end up with a perfectly valid model, but one that cannot be implemented. The business might also simply not have the technical capability to implement the model within the solution in question, such as when it has purchased a solution but doesn’t have access to the underlying code.

In many cases, this speaks to the maturity of the data science initiative within the business. Is the company hiring an individual to do some stats, or is it putting together a professional team, whose focus is not to build statistical models, but rather to implement operational solutions? This suggests a totally different maturity level, as the team, from the ground up, is being created for success.

What I am saying is: don’t hire a data scientist simply to first experiment with the data. Rather, a business must plan as though it is going to succeed. Put the people and structures in place from the ground up, to ensure modelling is only done on use cases that have valid profit and implementation models.

Most importantly, the business must run the initiative on the basis that it will be taking something to production. This implies setting up the capability to operationalise the solution upfront, which implies formal solution development and implementation, an IT function, rather than a data science skill set.

This all leads me to the lack of synergy between data science teams and IT. Sadly, I find IT to still be quite restrictive, taking much more of a governance approach, and not enough of a co-operative solution enablement approach. It is easy to say “No”, and certainly more challenging to say, “How can I help you?”

In IT’s defence, this is often based on how IT is managed with respect to their service level agreements and budget. As such, business themselves, if they want greater participation from IT, must also enable change that will encourage and allow IT to be more flexible in how they assist business.

What we need to see is an increasing conversation around how IT can supply data assets and services to data science teams. We need to see IT supplying not just raw data, but data assets. Think of a series of pillars; these should be the vertical sources supplied in some form of data-lake-style repository. At this point, this is just source data. Now, extend the height of the pillar by adding additional functionality on top of these pillars.

This functionality includes standardisation, normalisation, imputation, feature engineering and feature optimisation. Research suggests getting to this point comprises as much as 80% of the data science initiative – data engineering enables data science.

Next, do the work to connect the pillars, so that the data assets are easily combined. Allow the data scientists to freely explore this data with their tools of choice. Lastly, supply the dedicated capability to operationalise the resulting models built by the data science teams. Only once this becomes the norm, can we start patting ourselves on the back.

Let me conclude this Industry Insight with something fun, and downright silly. Imagine this ridiculous analogy – the approach needs to resemble a game of curling.

The business is the initial player that gently releases the carefully selected curling stone onto the ice. The data science teams are the stones that need to glide effortlessly across the ice.

The ice represents all the challenges the data science team will encounter on their journey. The two players with the brooms, vigorously brushing the ice in front of the stone, represent the IT function that has finally lost its inhibitions, and is doing everything in its power to heat up the ice, in order to allow the curling stone to travel as far as it can possibly go.

The current state of data

We have yet to see an increase in the synergy and co-operation between data science and IT.