Today's digital business world is eager to leverage artificial intelligence (AI) for all the benefits it offers in terms of increased efficiencies, leading to operational cost reduction and more.
However, what can be overlooked is that AI is founded on data and all too often it has not been appropriately readied for the job. This is a fact that underscores the need to prioritise data readiness as much as AI adoption itself.
Data that is siloed, inconsistent, outdated, or poorly structured will simply not cut it. If AI systems are to deliver accurate insights or actionable outcomes, they need clean, integrated and well-governed data, otherwise AI initiatives risk falling short.
According to Gartner, by 2026, more than 80% of enterprises will have used generative artificial intelligence (GenAI) application programming interfaces or models, and/or deployed GenAI-enabled applications in production environments. This contrasts with the 2023 estimate that calculated uptake at less than 5%.
Brave predictions maybe, but the fact of the matter is that AI cannot succeed without good and trusted data.
Emerging AI technologies such as agentic AI and GenAI can create strikingly realistic content that has the potential to enhance productivity in virtually every aspect of business. But if data is to be used for AI it must be high-quality and precisely prepared for these intelligent applications.
This means spending many hours manually cleaning and enhancing the data to ensure accuracy and completeness and organising it in a way that machines can easily understand. Also, this data often requires extra information − like definitions and labels − to enrich semantic meaning for automated learning and to help AI perform tasks more effectively.
While ML and GenAI applications work best on diverse data, the freshness of that data is also key.
The bottom line is the sooner data can be prepared for downstream AI processes, the greater the benefit. There are six principles for ensuring data is ready for use with AI. These are that data must be: diverse, timely, accurate, secure, discoverable and easily consumable by machines.
Let's delve into each of these points in detail.
Diverse data means not building AI models on narrow and siloed datasets. Quite the opposite: draw from a wide range of data sources spanning different patterns, perspectives, variations and scenarios relevant to the problem domain.
This data could be well-structured and live in the cloud or on-premises. It could also exist on a mainframe, database, SAP system, or software-as-a-service application. Conversely, the source data could be unstructured and live as files or documents on a corporate drive.
Let me expand on the potential human impact of bias in AI systems − also known as machine learning or algorithm bias. This occurs when AI applications produce results reflecting human biases, such as social inequality. This can happen when the algorithm development process includes prejudicial assumptions or, more commonly, when the training data has bias.
For example, a credit score algorithm may deny a loan if it consistently uses a narrow band of financial attributes. It's essential to acquire diverse data in various forms for integration into machine learning (ML) and GenAI applications, as it increases data diversity, reduces bias and helps to prevent AI applications from delivering unfair decisions.
Next brings us to the matter of timely data. While ML and GenAI applications work best on diverse data, the freshness of that data is also key. AI models trained on outdated information can produce inaccurate or irrelevant results.
Fresh data allows AI models to stay current with trends, adapt to changing circumstances and deliver the best possible outcomes. To ensure timely data, it's essential to build and deploy low-latency, real-time data pipelines for AI initiatives.
Change data capture is often used to deliver timely data from relational database systems, and stream capture is used for data originating from internet of things devices that require low-latency processing. Once the data is captured, target repositories are updated and changes are continuously applied in near-real-time to produce the freshest possible data.
The success of any ML or GenAI initiative hinges on one key ingredient: accuracy. A sponge and its ability to absorb is a good analogy here. AI models act like sophisticated sponges that soak up information to learn and perform tasks.
However, if the information is inaccurate, it's like the sponge is soaking up dirty water, leading to biased outputs, nonsensical creations, and, ultimately, a malfunctioning AI system. Therefore, data accuracy is a fundamental tenet for building reliable and trustworthy AI applications.
Data accuracy has three aspects to it, the first of which is profiling source data to understand its characteristics, completeness, distribution, redundancy and shape. Profiling is also commonly known as exploratory data analysis.
The second aspect is operationalising remediation strategies by building, deploying and continually monitoring the efficacy of data quality rules. Data stewards may need to be involved here to aid with data deduplication and merging. Alternatively, AI can help automate and accelerate the process through machine-recommended data quality suggestions.
The final aspect is enabling data lineage and impact analysis − with tools for data engineers and scientists that highlight the impact of potential data changes and trace the origin of data to prevent accidental modification of the data used by AI models.
High-quality, accurate data ensures models can identify relevant patterns and relationships, leading to more precise decisions, generation and predictions.
In my next article, I will explore the remaining principles crucial to getting data AI-ready.
Share