Researchers at the University of Cape Town (UCT) have developed a multilingual artificial intelligence (AI) language model trained on all 11 of South Africa’s official written languages.
The AI model aims to address gaps in how global AI systems serve local users.
The research, led by Anri Lombard, a master’s researcher in computer science at UCT; Dr Jan Buys, a senior lecturer in the Department of Computer Science at UCT; and Dr Francois Meyer, a lecturer in the same department, will be presented at the Language Resources and Evaluation Conference in Mallorca, Spain, this month.
The research introduces two components: MzansiText, a curated multilingual dataset covering SA’s official languages, and MzansiLM, a language model trained from scratch using that dataset.
The release comes as AI-powered language tools increasingly shape access to information and digital services globally, while performance remains inconsistent for many South African languages.
According to the researchers, queries in languages such as isiNdebele or Sepedi often produce inaccurate or incomplete responses in mainstream AI systems, largely due to limited training data.
“In language modelling, languages are considered low resource, primarily because there are much fewer and smaller textual datasets available in these languages for training language models,” says Buys.
“Our dataset, MzansiText, is still small compared to data available for high-resource languages such as English and major European and Asian languages, but larger than previous datasets for South African languages.”
Nine of SA’s 11 official written languages fall into the low-resource category. While isiZulu and isiXhosa have seen some development, others, including isiNdebele and Sepedi, remain underrepresented.
According to the researchers, MzansiLM is the first publicly available decoder-only language model designed to support all 11 official written languages in a single system.
“There has been real progress in language modelling for African languages, including some South African ones like isiXhosa and isiZulu,” adds Meyer. “But most existing models only cover a subset of languages. With MzansiLM, we wanted to build a single model focused specifically on South Africa that covers all 11 official written languages, including those that are often left out.”
Lombard points out the project originated from his master’s research into how different language-model architectures perform in low-resource environments.
“I came into this work through my master’s research, which looks at how different language-model architectures perform for low-resource languages, since that is still a relatively underexplored area,” he states.
“One thing that stood out to me is that publicly available models tended to cover only a subset of the South African languages we care about. MzansiLM was meant to provide a small decoder-only baseline that future work can compare against and build on.”
With 125 million parameters, MzansiLM is smaller than most commercial AI systems. Despite this, it performed strongly in targeted benchmarks, outperforming larger open-source models in several South African languages, he adds.
The researchers emphasise that MzansiLM is not a consumer-facing chatbot but a foundational model intended for further development.
“In practice, that means developers could build tools for specific use cases; for example, summarising information or annotating raw data, in South African languages,” Meyer states.
“Adapting MzansiLM for a limited use case might be more effective and affordable than relying on proprietary large language models, if you want users to be able to interact with a system in their home language.”
The researchers note that broader benefits will depend on future iterations and applications built on top of the model, and that research findings reflect a wider issue in AI development where performance gaps persist outside dominant languages such as English.
“Our findings show that the model can work well when fine-tuned for specific tasks but is not yet able to work well for general-purpose user interaction or instruction following, due to the limited training data,” Buys comments.
“This helps to explain why even larger language models don’t yet work as well when used in languages other than English.”
The team says continued collaboration within the research community will be critical to improving outcomes for African languages.
UCT has made MzansiText and MzansiLM publicly available to support further research and development. The paper, “MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages”, is available on arXiv.

