The GSM Association (GSMA) and Pleias have introduced CommonLingua, an open-source language identification (LID) model.
Pleias is a research lab and artificial intelligence (AI) company that specialises in open, auditable language models trained on permissively licensed data.
Released under the GSMA’s African AI languages model project, CommonLingua is a two-million-parameter open-source model covering 334 languages, including 61 African languages, according to a statement.
The industry body explains that leading LID systems, such as fastText, GlotLID and OpenLID, were built around European and Asian languages.
The limitations are that African-language text is “frequently mislabelled” as English or French. Even state-of-the-art frontier models drop roughly 30 points in accuracy on African languages compared to major world languages, it says.
CommonLingua is designed to fix this step of the pipeline, unlocking African language data at scale. The model achieves 83% accuracy, claims the GSMA.
It covers 61 African languages across eight language families: Bantu (21), Niger-Congo / West African (18), Afro-Asiatic and Semitic (7), Cushitic and Chadic (4), Berber (3), Nilo-Saharan (3), and pidgins, creoles and other (5).
The model operates directly on UTF-8 byte sequences rather than relying on a language-specific tokeniser, enabling consistent handling across scripts, including Latin, Arabic, Ethiopic, N’Ko and Tifinagh.
“African languages are not an edge case. They are the working languages of hundreds of millions of people, and they deserve AI infrastructure built with the same care as any other language,” says Pierre-Carl Langlais, co-founder and CTO at Pleias.
“CommonLingua is deliberately the first brick we are laying: you cannot curate what you cannot identify.”
The African continent is a multilingual melting pot, with around 2 000 to 3 000 distinct languages. For example, Nigeria alone has more than 500 languages. South Africa has 11 spoken languages, but only one in 10 South Africans speak English at home – the language that dominates the internet.
According to the GSMA, the model is trained exclusively on open-licensed and public domain content aggregated through the Common Corpus project, including Wikipedia, Scientific publications in OpenAlex, VOA Africa, WaxalNLP, Cultural Heritage and Pralekha. All datasets are released under permissive licences.
Louis Powell, director of AI initiatives at GSMA, comments: “Closing the gap in African-language AI is fundamental to digital inclusion and unlocking economic opportunity. Progress has long been held back by the lack of foundational infrastructure, beginning with something as essential as language identification.
“CommonLingua addresses this critical gap, enabling the development of richer datasets and more representative AI systems at scale.”

