To view this email online, paste this link into your browser:
[link removed]
NEW FROM CDT
([link removed])
CDT Research Identifies Key Shortcomings of Large Language Models Used to Analyze Non-English Languages
Today, the Center for Democracy & Technology released a new report that lifts the veil on the capabilities and limitations of a new machine learning technology ([link removed]) — multilingual language models — that companies are using to analyze and generate non-English language content.
Multilingual language models are built to address a technical challenge facing online services: there is not enough digitized text in most of the world’s 7,000+ languages to train AI systems. Researchers claim that, by scanning huge volumes of text in dozens or even hundreds of different languages, multilingual language models can learn general linguistic rules that can help them understand any language.
While multilingual language models have had impressive results at basic tasks, like parsing grammar, in research environments, companies are using them often in the real world for more language- and context-specific tasks, and we have reason to believe that they don’t perform as well in that context.
To fully understand the effects of these models, we need to know more about how and where companies are using them, as training and testing them on only a small fraction of text in certain languages could create an inability to understand those languages that results in real barriers to information access and outsized negative impacts on individuals’ lives and safety.
In our paper, which examines how these models work, we identified several specific shortcomings:
Multilingual models are built predominantly on English-language data. They thereby encode English-language values and assumptions, and import them into analysis and generation of text in other languages, overlooking local context and limiting accuracy.
Multilingual language models are often trained and tested on machine-translated text, which can contain errors or terms that native language speakers don’t use in practice.
When multilingual language models fail, their problems are hard to identify, diagnose, and fix.
The more languages a multilingual language model trains on, the less it captures the idiosyncrasies of each one. Languages interfere with one another, meaning that developers need to balance teaching models more languages versus improving how well they work in each one.
Companies like Google ([link removed]), Meta ([link removed]), and Bumble ([link removed]) are already using these tools to detect and even take action on problematic content. Others may soon use them to power automated tools that scan resumes or immigration applications.
In order to improve multilingual language models and hold them accountable, companies need to reveal more about the data used to train these models, funders need to invest in the growing communities that are documenting and building natural language processing models in different languages, and governments need to avoid using these models in ways that may threaten civil liberties.
Read the full report on CDT’s website ([link removed]), and RSVP to join us tomorrow to discuss the paper at an event called “Mind the Gap.” ([link removed])
#CONNECT WITH CDT
SUPPORT OUR WORK ([link removed])
([link removed])
([link removed])
([link removed])
([link removed])
1401 K St NW Suite 200 | Washington, DC xxxxxx United States
This email was sent to
[email protected].
To ensure that you continue receiving our emails,
please add us to your address book or safe list.
manage your preferences ([link removed])
opt out ([link removed]) using TrueRemove(r).
Got this as a forward? Sign up ([link removed]) to receive our future emails.
email powered by Emma(R)
[link removed]