Multilingual language models are built to address a technical challenge facing online services: there is not enough digitized text in most of the world’s 7,000+ languages to train AI systems. Researchers claim that, by scanning huge volumes of text in dozens or even hundreds of different languages, multilingual language models can learn general linguistic rules that can help them understand any language.
While multilingual language models have had impressive results at basic tasks, like parsing grammar, in research environments, companies are using them often in the real world for more language- and context-specific tasks, and we have reason to believe that they don’t perform as well in that context.
To fully understand the effects of these models, we need to know more about how and where companies are using them, as training and testing them on only a small fraction of text in certain languages could create an inability to understand those languages that results in real barriers to information access and outsized negative impacts on individuals’ lives and safety.
In our paper, which examines how these models work, we identified several specific shortcomings:
- Multilingual models are built predominantly on English-language data. They thereby encode English-language values and assumptions, and import them into analysis and generation of text in other languages, overlooking local context and limiting accuracy.
- Multilingual language models are often trained and tested on machine-translated text, which can contain errors or terms that native language speakers don’t use in practice.
- When multilingual language models fail, their problems are hard to identify, diagnose, and fix.
- The more languages a multilingual language model trains on, the less it captures the idiosyncrasies of each one. Languages interfere with one another, meaning that developers need to balance teaching models more languages versus improving how well they work in each one.
Companies like Google, Meta, and Bumble are already using these tools to detect and even take action on problematic content. Others may soon use them to power automated tools that scan resumes or immigration applications.
In order to improve multilingual language models and hold them accountable, companies need to reveal more about the data used to train these models, funders need to invest in the growing communities that are documenting and building natural language processing models in different languages, and governments need to avoid using these models in ways that may threaten civil liberties.