About
Subscribe
  • Home
  • /
  • Health tech
  • /
  • Why SA needs locally representative datasets to train diagnostic AI

Why SA needs locally representative datasets to train diagnostic AI

Accurate, fair diagnostic AI can help clinicians make faster, better-informed decisions and reduce unnecessary costs.
Derryn Bentley
By Derryn Bentley, Head of new business development, Info.Blueprint.
Johannesburg, 02 Dec 2025
Derryn Bentley, head of new business development at Info.Blueprint.
Derryn Bentley, head of new business development at Info.Blueprint.

Diagnostic (AI) is transforming global healthcare. Diagnostic AIs are trained to identify, classify, or predict conditions based on such as imaging, pathology results and clinical records. By analysing vast and complex datasets, diagnostic AI helps clinicians reach conclusions faster and with greater accuracy than would otherwise be possible.

Worldwide, AI systems are being deployed to interpret scans, detect cancers, predict heart disease and flag patients at risk of deterioration. In South Africa, such tools could be game-changing, particularly in resource-constrained public hospitals and clinics, where staff shortages and case loads remain high. 

Used well, diagnostic AI can support faster decision-making, improve outcomes and ensure scarce medical resources are directed where they are most needed.

But diagnostic models are only as good as the data they learn from. If the data does not reflect South Africa’s population, health conditions and environmental context, the technology risks making poor or inequitable decisions − a serious concern in a country already grappling with deep healthcare inequalities.

In South Africa, the private healthcare sector has led the way in adopting diagnostic AI. Medical insurers, hospital networks and healthcare providers are investing in machine learning tools to improve operational efficiency, lower costs and enhance patient experience. These initiatives aim to reduce unnecessary procedures and enable quicker, more accurate diagnoses.

In essence, synthetic data behaves like the real thing but protects privacy by design.

By contrast, adoption in the public sector remains limited. Fragmented health records, legacy paper systems and infrastructure constraints have slowed progress.

Nonetheless, the Western Cape’s Provincial Health Data Centre (PHDC) stands as a strong example of what can be achieved. By integrating person-level data across multiple systems under robust privacy governance, PHDC demonstrates the value of building reliable, interoperable data ecosystems.

Yet most diagnostic AI models currently available are still trained on international datasets. These tend to reflect populations in Europe, North America, or parts of Asia − places with different disease patterns, lifestyles and genetic factors. As a result, AI tools that perform well abroad may not generalise effectively to South African settings.

Risks of using non-local data

South Africa faces a unique mix of public and social health challenges. High rates of HIV, tuberculosis (TB) and non-communicable diseases such as diabetes, hypertension and heart disease combine to create complex comorbidity patterns. The way these conditions interact in patients is often very different from patterns seen elsewhere. All of which needs to be seen in the context of South Africa’s high levels of poverty, malnutrition, access to healthcare and associated factors.

When AI models are trained on international datasets that do not reflect these realities, their outputs can be dangerously inaccurate.

A diagnostic tool optimised for European patients, for example, might misinterpret lab results from someone managing both HIV and TB. The result could be delayed diagnosis, inappropriate treatment or misdirected care. Context is king in using and interpreting this data, something that is missing in international datasets.

The consequences extend beyond clinical accuracy. In a country where many people cannot afford extended sick leave, misdiagnosis can mean lost income, unnecessary costs and prolonged illness.

Moreover, urban data rarely captures the realities of rural healthcare, where access to services and disease prevalence differ sharply. Without locally representative data, AI systems risk deepening rather than reducing inequality.

The National Health Laboratory Service and National Institute for Communicable Diseases maintain valuable datasets, but these are not easily available. Expanding and diversifying available data − safely and ethically − is therefore essential.

Synthetic data: A practical solution

One emerging approach is the use of synthetic data – artificially-generated information that mirrors the patterns and structure of real-world datasets but contains no actual personal details.

Advanced machine learning models such as generative adversarial networks and variational autoencoders are trained to learn the complex relationships within real data, then create new, artificial records that preserve these relationships without revealing anyone’s identity.

In essence, synthetic data behaves like the real thing but protects privacy by design. This approach aligns neatly with South Africa’s Protection of Personal Information Act, which sets stringent requirements for processing sensitive health data. Because synthetic data carries no identifiable patient information, it significantly reduces the risk of breaches or non-compliance.

Synthetic data can address several of South Africa’s most pressing diagnostic challenges:

Augmenting limited datasets: For rare or under-recorded diseases, including tropical infections or co-morbid conditions such as HIV-related cancers, synthetic data can expand the dataset to improve AI model accuracy.

Enabling safe collaboration: Synthetic datasets can be shared across institutions − public and private hospitals, universities and start-ups − without breaching privacy laws, enabling innovation at lower risk.

Supporting digital readiness: Facilities that still rely on paper records can use synthetic data to test and train new electronic health systems before real patient data is available.

The approach also supports the South African Health Products Regulatory Authority in its oversight of AI-enabled medical devices. By using synthetic data for model validation and stress-testing, developers can prove algorithm robustness and fairness without exposing real patient information.

Challenges and guardrails

Despite its potential, synthetic data is not a silver bullet. Because it reflects the statistical properties of the original data, any existing bias can be reproduced or even amplified. If certain groups − rural patients, women, or specific age brackets − are underrepresented in the source data, the resulting synthetic dataset will likely carry the same imbalance.

This makes transparency and governance vital. Organisations must document how synthetic data is created, what source data it draws from, and how bias is being mitigated. Diagnostic AI systems trained on synthetic data must still undergo rigorous clinical validation using real patient datasets to ensure accuracy, realism and safety.

Infrastructure and skills gaps also remain a barrier. Many public health facilities still lack the computing power or data science capacity to implement such technologies effectively. Investment in local expertise, data infrastructure and ethical oversight is therefore essential to ensure responsible deployment.

South Africa’s opportunity

SA’s robust data protection framework, advanced academic community and growing digital health networks place it in a strong position to lead the responsible use of synthetic data.

With its diverse, multilingual population and high burden of both communicable and non-communicable diseases, the country offers an invaluable testing ground for developing diagnostic AI that is both equitable and effective.

Initiatives like the PHDC and the South African Population Research Infrastructure Network already provide a foundation for integrated data systems. Synthetic data could accelerate their value − enabling faster research collaboration, reducing time-to-deployment for new models, and building AI tools tuned to SA’s unique clinical realities.

Ultimately, locally representative datasets are not just a technical requirement; they are a moral and economic imperative. Accurate, fair diagnostic AI can help clinicians make faster, better-informed decisions, reduce unnecessary costs, and improve patient outcomes across the public and private sectors alike.

For SA, the path forward lies in combining real and synthetic data to reflect its full spectrum of health diversity. The gap between private and public healthcare data needs to be narrowed − while private healthcare data is often of higher quality on a number of metrics, it isn't fully representative.

Both of these needs to be addressed to ensure the future of diagnostic AI is built not just on technology, but on trust, inclusion and local relevance.

Share