How far have we, together, come to provide equal language technology performance around the world?
Natural Language Processing (NLP) applications are now ubiquitous and used by millions of individuals worldwide on a daily basis. Nevertheless, these applications can be overwhelmingly brittle and biased. For example, it has been seen that the accuracy of syntactic parsing models drops by at least 20 percent on African-American vernacular English when compared to textbook-like English (how it is commonly spoken by the more privileged class of Americans). Further, sentiment analyzers fail on language originating from different time periods, question-answering systems fail on British English, conversational assistants struggle to interact with millions of elderly people with speech disabilities, and hate speech detection systems are biased and more likely to classify language from specific demographics incorrectly as offensive. In short, NLP models and applications work well only for a minority of the population, effectively excluding a significant majority that uses such applications exactly as often. It is shocking to see that roughly 6500 languages are spoken in the world today, however, the advancement in NLP in academia and industry focuses on a minuscule subset.
There has been a growing awareness of intrinsic biases in NLP technologies and recent research efforts have been driven towards building novel metrics and tests to evaluate these biases.
Given the rapid rate of technology adoption globally, there is a pressing need for measuring and understanding NLP performance inequalities across the world’s languages. In this blog post, we summarize two recent publications that address this matter:
Blasi, D., Anastasopoulos, A., & Neubig, G. (2021). Systematic Inequalities in Language Technology Performance across the World’s Languages. arXiv preprint arXiv:2110.06733. By researchers at Havard, George Mason University, and CMU.
Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The state and fate of linguistic diversity and inclusion in the NLP world. arXiv preprint arXiv:2004.09095. By researchers at Microsoft Research India.
At first, we mention a novel estimate about how the utility (defined in detail below) afforded by NLP systems is distributed across individuals, languages, and tasks at an unprecedented global scale. These estimates are a good measure to help us identify which languages are systematically underserved by language technologies and could benefit the most individuals from focused technology development.
Sounds interesting, right?
The fundamental goal of the first paper (Blasi et al.,2021) is to evaluate the distribution of diverse representative language technologies (and their qualities) across the world’s languages and their populations. They ask a vital question.
Can we derive a pattern between the demand of language technologies and the utility they provide to users across languages?
Figure 1: Formulation of the global normalized metric.
Figure 1 demonstrates the formulation of the global normalized metric, combining both two appropriate terms, utility and demand, defined as:
The utility of a given language technology on a given task and language can be defined as its performance normalized by the best possible performance afforded by such a task in any language.
The demand for a given language technology considers demographic factors τ=1 (language technology needed is proportional to the number of speakers of the language) and linguistic features τ=0 (demand across all languages is equal). Mathematically, it is defined as
The final metric, M(τ), is bounded between 0 and 1. Here, 0 corresponds to a case where no one benefits from given language technology, whereas 1 would correspond to a situation where all languages enjoy the perfect technology.
For which major NLP tasks can we apply our metric of utility and demand?
As illustrated in Figure 2, the paper considers six diverse and major representative NLP tasks — three user-facing tasks and three technical/linguistically focussed tasks.
Figure 2: NLP tasks evaluated in the study.
Figure 3 below demonstrates the linguistic and demographic global utility metrics (Mτ) for a number of language technology tasks as defined above.
Figure 3: Linguistic and Demographic global utility metrics for aforementioned NLP tasks.
It becomes obvious that most NLP tasks perform substantially better when utility is measured demographically rather than linguistically. Text-to-speech synthesis has the most linguistic coverage: a single study by Black, 2019 covers more than 630 languages (or about 10% of the world’s languages). However, Ren et al. (2021) find that for the vast majority of these languages the measured quality of the generated speech is about half as good as the exceptionally good English system. Surprisingly, the next tasks are morphological inflection and dependency parsing and have been evaluated over 140 and 90 languages, respectively. Natural Language Inference (NLI) and Question Answering (QA) focus on up to 15 and 17 languages, respectively, leading to very low scores on the linguistic axis.
Figure 4: Recent historical progression of two language technology tasks: Inflection and Machine Translation from English.
Next, Figure 4 shows the progress of the utility metrics in tasks with access to comparable data across a span of the last 7 years. The extensive efforts of the UniMorph project (Kirov et al., 2018) to cover as many languages as possible are visible in the “inflection” plot, with significant improvements over time. On the other hand, the machine translation field is still in the process of ramping up linguistic utility, with improved demographic coverage over the years.
We redirect the reader to (Blasi et al., 2021) to see deeper breakdowns of how well these tasks are doing to serve every language or every population.
Can we use these findings to discover which languages will lead to the largest global utility improvement?
The relative importance of linguistic vs. demographic demands can determine a priority ranking, as observed in Figure 5 for a sample of five tasks: Speech Synthesis, Morphological Inflection, Syntactic Analysis, and Machine Translation from/to English.
Figure 5: The priority languages (top-3) change with the different balancing of Demographic and Linguistic utility.
Improving on the demographic-focused utility entails a greater emphasis on Mandarin Chinese (cmn), Hindi (hin), Spanish (spa), and other populous languages that are generally well-served by current technologies.
Balancing linguistic and demographic considerations leads to prioritizing a more diverse set of languages, mostly Asian and African languages like Amharic (amh), Bambara (bam), Bengali (ben), Thai (tha), or Yoruba (yor), which are both populous and under-served, along with also large but severely under-served languages like Kurdish, Urdu, and Oromo. Further emphasis on linguistic utility would lead to the prioritization of indigenous and potentially endangered languages of small communities like Aimele, Itelmen, North Sami, or Warlpiri, which are currently largely ignored by NLP research.
Overall, this allows us to approximate how well technology is serving potential users throughout the world. It also allows us to identify “pain points” — languages that seem to be most underserved, based on our priorities with respect to equity of language or population coverage.
Together, we are curious,
How has the fate of different languages changed with current language technologies?
First, how many resources are available across the world’s languages and do they correlate to the number of speakers?
To get an idea of the digital resource status for different languages, a taxonomy is built based on two features which are plotted in Figure 6. On the y-axis, Labeled Data from LDC and ELRA catalog is considered and the x-axis corresponds to Unlabelled Data from Wikipedia pages (used for many pre-trained language models).
Figure 6: Language Resource Distribution and examples from each language class.
Overall, six classes are formed, as seen in Figure 6. The spectrum from violet to red corresponds to the total speaker population size and the size of the circle indicates the number of languages in each class.
One of the upfront findings is that there is a large percentage of speakers who have very insufficient access to language technologies. This not only creates a technological or communication barrier, but a bigger problem is that it can lead to the extinction of that language.
Figure 7: Number of languages, number of speakers, and percentage of total languages for each language class.
The next question is,
How inclusive have NLP conferences been in conducting and publishing research for different languages?
After calculating an entropy score for the distribution of languages representation each year for different papers in a set of major NLP conferences, it is observed that LREC (Conference on Language Resources and Evaluation) and Workshop proceedings are really inclusive of different languages. Another observation is that across most conferences, there is a spike in entropy score which might be attributed to the birth of massively multilingual models.
Figure 8: Language occurrence entropy over the years for different NLP conferences.
We would like to conclude by urging the NLP community to take a more fine-grained look at amplifying these disparities and how everyone can contribute towards mitigating this. Towards the end of your research, it is essential to ask:
Do your methods and experiments apply (or scale) to a wide range of languages? Are your findings and contributions supporting the inclusivity of various languages?
To further continue this discussion, let us connect!
Slack Community: https://join.slack.com/t/neuralspacecommunity/shared_invite/zt-xlj1xr8k-GQrOkp7tRIV9IuI_0GWS7Q