Analytics’ value is directly dependent on data quality

Johannesburg, 20 Apr 2020

Matthew Bernath, head of data analytics, RMB

The proliferation of big data from a multitude of structured and unstructured sources makes it easy to gather vast quantities of information for use in advanced analytics, predictive marketing and robotic process automation. The challenge, of course, is that it is not always easy to determine if there are inaccuracies in such high-speed and high-volume data – and if the data that drives what you are doing is bad, then even the most sophisticated tools are essentially worthless.

This is the view of Matthew Bernath, head of data analytics at RMB, who spoke at a recent informal ‘meet-up session’ hosted by SAS, and designed to offer both theory and practical knowledge related to the field of data science. According to SAS, such events are about encouraging data science students, full-time data analysts and corporate executives to come together to learn something new about the latest trends in this space, as well as to offer them a chance to improve their own skills in this field.

Bernath points to real world examples where serious incidents have been caused by what was initially thought to be a faulty AI, but actually turned out to be bad information that was being fed to these AIs.

“Although such examples are on the extreme end of the scale, it is a good demonstration of what can happen when bad data is unknowingly used – the principle of ‘garbage in, garbage out’ always applies. If you build your model using bad data, it is inevitable that the model will not perform as expected.”

Bernath points out that this presents a challenge to data scientists, since the whole point of AI is to rapidly obtain the kind of answers that would be impossible to arrive at for a human being. However, when untrustworthy data is applied to the model, it negates the point of using AI – this, he says, is something referred to as ‘automation surprise’.

“It is also important to understand that there are many reasons for data to be inaccurate. Bias can play a major role here and can creep into analytics, especially when using historical data to train models.

“It is good advice to data scientists to always be cognisant of training their model properly, and to also be aware that in such situations, more is always better. If there is simply not enough information available for the algorithm to learn from, it will also ultimately fail.”

Of course, he adds, identifying that data may be biased is also no easy task, pointing to how when Go world champion Lee Sedol played against the AlphaGo AI, both he and observers were totally confused by a move the machine made in one specific match. In the 37th move of the second game, AlphaGo did something so unusual that, at first, Go experts commenting on the match assumed the person responsible for physically placing AlphaGo’s stones on the board had made a mistake.

“This means that although there are huge opportunities for algorithms to make better decisions than humans, it remains our prerogative as data scientists to ensure there is always sufficient governance around all input data and that we clearly understand the workings of the model being used.”

“To achieve this, you will need a fundamental understanding of data science tools and principles when building your models, to ensure that the correct toolset for the data is used and that the information is valid and clean. This is no easy task, but as analysts in a cutting-edge discipline, it is our job to identify unusual decisions made by AI as quickly as possible, in order to determine whether this is as a result of the machine ‘thinking’ far ahead of the human brain, or due merely to bad data. There is no doubt that this makes data science more challenging, but it is a challenge we must rise to,” he concludes.

Analytics’ value is directly dependent on data quality

The best analytics tools and most competent data scientists in the world are not necessarily enough to overcome the problems that arise when bad data is utilised in an otherwise good system.