Subscribe
About

Overcoming test data hurdles with realistic synthetic data

Quality datasets are crucial for AI training, but the need to protect real-world data can slow development and implementation.
Bryn Davies
By Bryn Davies, CEO, InfoBluePrint.
Johannesburg, 18 Jun 2025
Bryn Davies, CEO of InfoBluePrint.
Bryn Davies, CEO of InfoBluePrint.

As large enterprises step up application development and need to train more artificial intelligence (AI) models more rapidly, software and AI developers have encountered a significant hurdle in the process − the data needed to test applications and train AI.

Quality datasets are crucial for development and training AI, but the need to protect real-world data can slow development and AI implementations.

To overcome the challenge of the availability of realistic test data, synthetic data is gaining traction as it offers a practical solution to both the challenges of accessing quality test data, and to data privacy concerns.

Traditional test data downsides

A key issue for most organisations is the privacy of customer information. POPIA compliance, and the need to secure personally identifiable information (PII) that organisations have in their customer databases, means this data should not be used outside of highly-secured production environments for testing new applications or training AI.

But the IT department or developers need test data in order to test new applications, and enhanced or changed applications, and the best test data is real data. Traditionally, developers would take an extract out of the customer database and load that into their development environment to run their tests against. However, development environments are not inherently as secure as production environments, raising concerns about security, data misuse and compliance.

Synthetic datasets can now be created that accurately reproduce the statistical patterns of genuine datasets.

For years, the alternative has been some form of test data. However, traditional ‘dumb’ test data is often either randomly scrambled real data or fabricated test data that does not represent the distribution and properties of actual data.

This test data has several downsides – it can be costly to create and scale and it doesn't represent the real world and it is thus not ideal for testing.

Synthetic data for realistic test environments

The emergence of modern, safe and realistic synthetic data is helping organisations overcome these challenges.

Thanks to more sophisticated tools for synthetic data generation, synthetic datasets can now be created that accurately reproduce the statistical patterns of genuine datasets. Synthetic datasets can be created and scaled rapidly and more cost-effectively than traditional test datasets.

Importantly, they are also less likely than genuine datasets to be biased or flawed. When training AI models, randomised traditional test data may include bias and misrepresentations which impact the AI model, since a model is only as good as the data it is trained on. Synthetic data will deliver much more realistic test data, aligned to real-world situations.

Synthetic data in action

Synthetic data tools can generate realistic datasets where no production data exists, simulate a database based on patterns in existing databases, or automatically mask PII, including that occurring in unstructured or semi-structured formats such as documents, e-mails and .xml files.

The difference between traditional test data and modern synthetic data is that synthetic data solutions typically use AI to profile production data and to reproduce realistic fake data, while maintaining the statistical properties of the real data. As a very simple example, if the customer database has 60% male and 40% female customers, it will preserve that kind of distribution in the test data.

It will go further to preserve potential relationships between different elements in that data. So, for example, you might have gold, silver and blue level customers, which relates to typical transaction amounts or levels of interaction. In normal ‘dumb’ test data, that sort of correlation is not represented, but synthetic data will automatically pick up that relationship and represent that in the fake data it produces.

Adoption is accelerating and a growing number of South African organisations are looking to synthetic data tools to help them fast-track development and model training. Underlining this trend, Gartner estimates that by 2030, synthetic data will completely overshadow real data in AI training models.

With realistic synthetic data, organisations unblock access to test and training data and eliminate data privacy concerns to speed up development with the assurance that applications and AI models will perform as expected.

In future, we can expect to see synthetic data used for more complex use cases across integrated value chains in sectors as varied as financial services, healthcare, autonomous vehicles, manufacturing, retail and marketing.

Share