The security side of getting data AI-ready

By Louis De Gouveia, Data competency manager at iOCO.

Johannesburg, 30 May 2025

Louis De Gouveia, data competency manager at iOCO.

In my previous article, I covered the principles crucial to getting data AI-ready; namely, data must be: diverse, timely, accurate, secure, discoverable and easily consumable by machines. Here I expand on the remaining principles and the all-important issue of security.

Artificial intelligence (AI) systems often use sensitive data − including personally identifiable information, financial records, or proprietary business information − and use of this data requires responsibility.

Criminals are very capable of stealing sensitive information, manipulating training data to bias outcomes, or even disrupting entire generative AI (GenAI) systems. Securing data is crucial to privacy protection, maintaining model integrity and guaranteeing the responsible development of powerful AI applications.

Three tactics can help companies to automate data security at scale, since it's virtually impossible to do it manually. Data classification detects, categorises and labels data that feeds the next stage. Data protection defines policies like masking, tokenisation and encryption to conceal the data. Finally, data security defines policies that describe access control, such as who can access the data.

The three concepts work together as follows: first, privacy tiers should be defined and data tagged with a security designation of sensitive, confidential, or restricted. Next, a protection policy needs to be applied to mask restricted data. Finally, an access control policy must be implemented to limit access rights.

Data transformation is regarded as the unsung hero of consumable data for machine learning.

Next, data needs to be discoverable. AI-ready data must be discoverable and readily accessible within the system. Discoverable data unlocks the true potential of machine learning (ML) and GenAI, allowing these workloads to find the information they need to learn, adapt and produce groundbreaking results.

Good metadata practices drive discoverability. Beyond technical metadata, defining business metadata and semantic typing enhances both automated and human understanding. All metadata is then indexed and searchable via a data catalogue.

Data must be easily consumable by ML or large language models (LLMs). AI initiatives won't be successful if the data is not in the right format for ML experiments or LLM applications.

The true potential of ML and GenAI applications rests with the ability to readily consume data. Unlike humans who can decipher handwritten notes or navigate messy spreadsheets, these technologies require information to be represented in specific formats.

Making data easily consumable helps unlock the potential of these AI systems, allowing them to ingest information smoothly and translate it into intelligent actions for creative outputs.

Data transformation is regarded as the unsung hero of consumable data for ML. While algorithms like linear regression grab the spotlight, the quality and shape of the data they're trained on are just as critical.

Moreover, the effort invested in cleaning, organising and making data consumable by ML models reaps significant rewards. Prepared data empowers models to learn effectively, leading to accurate predictions, reliable outputs and, ultimately, the success of the entire ML project.

However, training data formats depend highly on the underlying ML infrastructure. Traditional ML systems are disk-based, and much of the data scientist workflow focuses on establishing best practices and manual coding procedures for handling large volumes of files.

More recently, lakehouse-based ML systems have used a database-like feature store, and the data scientist workflow has transitioned to SQL as a first-class language. As a result, well-formed, high-quality, tabular data structures are the most consumable and convenient data format for ML systems.

Making data consumable for GenAI

Large language models (LLMs) − like OpenAI's GPT-4, Anthropic's Claude and Google AI's LaMDA and Gemini − have been pre-trained on masses of text data and lie at the heart of GenAI.

OpenAI's GPT-3 model was estimated to be trained with approximately 45TB of data, exceeding 300 billion tokens. Despite this wealth of inputs, LLMs can't answer specific questions about your business, because they don't have access to the company's data.

The solution is to augment these models with your company’s own information, resulting in more correct, relevant and trustworthy AI applications.

The method for integrating corporate data into an LLM-based application, in a safe and secure way, is called retrieval-augmented generation.

The technique generally uses text information derived from unstructured, file-based sources, such as presentations, mail archives, text documents, PDFs, transcripts, etc. The text is then split into manageable chunks and converted into a numerical representation used by the LLM in a process known as embedding.

These embeddings are then stored in a vector database like Chroma, Pinecone and Weviate. Interestingly, many traditional database vendors − such as PostgreSQL, Redis and SingleStoreDB − also support vectors. Moreover, cloud platforms like Databricks, Snowflake and Google BigQuery have recently added support for vectors, too.

In conclusion, despite the transformative power of ML, plus GenAI's explosive growth potential, data readiness remains the cornerstone of any successful AI implementation.

The key principles I have discussed for establishing a robust and trusted data foundation combine to help your organisation to unlock the true potential of AI.

The security side of getting data AI-ready

Three tactics, working together, can help companies to automate data security at scale, as good metadata practices drive discoverability.

Making data consumable for GenAI