The importance of data quality in AI
Before deploying AI models to solve business problems, it’s vital to take a step back and assess the quality of available data.
It is undeniable that artificial intelligence (AI) has permeated every aspect of our lives in a modern, digital world − from entertainment and manufacturing, to security and even healthcare.
The AI models that drive intelligent machines rely heavily on data, and for the results to be accurate and reliable, the data being consumed needs to be of high quality.
As such, before deploying AI models to solve business problems, it is important for organisations to take a step back and assess the quality of their available data.
There are a few critical aspects as to why data quality is so important in AI. These include:
Accurate results: High-quality data can help organisations derive more accurate insights for better decision-making and unlock innovation through AI and machine learning, which can help them to provide better user experiences and create new income streams through pivoting new products and services.
Data quality improvement is a continuous process.
Unbiased training data: In a multicultural environment it is possible for datasets that AI models are trained on to be biased in a manner that perpetuates existing prejudices and stereotypes. It is important to identify these biases and correct them by ensuring the training data is more representative of the population being modelled.
Trust in the data: AI projects are likely to succeed if the stakeholders trust the data they are working with. When users trust the data in general, they will also trust insights from the AI models that use that data, further entrenching AI in the business.
Data quality challenges
There are some common challenges that exist when it comes to data and its quality, and organisations need to be aware of these and manage them as best as possible. These include:
Lack of data quality controls: Data quality issues are often identified by users later at the reporting layer, which is a source of frustration, as it may render a report useless due to missing or incorrect information. Indeed, data sources or systems further upstream are usually to blame. However, oftentimes the source owners are oblivious of the impact the missing data has on the downstream systems.
Constantly changing source systems: As organisations evolve and grow, source systems are constantly changing. This means data quality measures already in place need to be aligned with the new source systems. This process can take time while all stakeholders are familiarising themselves with the new source systems. The magnitude of the data quality challenges may only be quantified much later when the impact on the downstream systems is known.
Challenges of new data type: Big data analytics technologies have introduced new capabilities, making it possible to store and process data formats, such as unstructured text, images and videos. This adds another complexity, as this kind of data tends to come in massive datasets, which makes it difficult and often impossible to manually assess, requiring specialised tools to assess its quality.
Poor data architecture: Modern data tools will not deliver anticipated benefits if the underlying data architecture is not performant and robust enough. As more sources get integrated and data volumes grow, a badly designed environment may fail to perform even a basic task, such as maintaining consistent data ingestion to ensure timeous data availability to downstream systems. This may lead to incomplete datasets, or even failures for some business processes, which may lead to general mistrust of data by business users.
What can be done to improve data quality?
Some of the areas organisations can focus on to improve data quality include:
Data literacy: Senior management sponsored data literacy campaigns can help entrench an organisation-wide data culture. By fostering a data culture, organisations create an environment where all employees understand the importance of quality data on the organisation’s overall performance. Line managers’ KPAs can be linked to level of data quality from their teams; for example, a call centre manager may need to ensure at each interaction contact information like address and contact numbers of customers is always confirmed and updated if necessary. Staff need to understand the importance of capturing as many data points as possible, including those that are not mandatory − even non-mandatory fields can provide important information/insights in some contexts.
Data capture: Data validation and auto populating forms with information from trusted sources − like home affairs, SARS, internal HR systems and banks − can help improve data quality, as it reduces manual data capturing errors. It can also help ensure most information is captured, and the agent can focus on the fields that are not auto populated.
Intuitive input screens: An innovative design of input interfaces where only relevant fields are presented; for example, based on the type of customer, instead of presenting a one-size-fits-all form can make it easier to capture data.
Data quality tools: In the age of big data where the sheer size of data volumes and the various formats − both structured and unstructured − that data comes in means manual data quality check methods would not be feasible. Most data analytics vendors – such as Informatica, SAS and IBM − have tools specifically designed for data quality. These tools can highlight missing data or data anomalies so that they can be proactively corrected.
Dedicated data quality team: Data quality improvement is a continuous process, and it is important that this function is staffed with dedicated data quality specialists equipped with the right tools. Having a data quality team ensures there is a team dedicated solely to improving data quality with clear and well-defined KPIs.
Data quality control framework: Lastly, organisations need to implement a data quality control framework to assess, measure and track improvements in the quality of their data. The framework should seek to monitor and address all five data quality elements: accuracy, completeness, reliability, relevance and timeliness. All players in the data value chain − including business users, data stewards, IT data teams and source system owners − need to be involved to ensure the framework is implemented.
As organisations modernise their data estate, they should simultaneously pay particular attention to the quality of data. One of the motivations for data modernisation is for organisations to benefit from tools that come with a modern data platform such as “AI as a service”.
However, the benefits of AI and machine learning cannot be fully realised if the quality of data the models will be trained on, and eventually consume, is not good.
Director, PBT Innovation, PBT Group.
Dube is a data engineering consultant with local and international experience spanning telco and broadcast media industries, and large-scale greenfield data warehouse projects.
Dube is a data engineering consultant with local and international experience spanning telco and broadcast media industries, and large-scale greenfield data warehouse projects.He holds a BSc degree with computer science and mathematics majors from the University of the Witwatersrand. He also has scientific computing, Python, machine learning and statistical analysis certificates from WorldQuant University and is an AWS certified cloud practitioner.