Tenzin Singhay Bhotia
Challenges in using NLP for low-resource languages and how NeuralSpace solves them
The interest in Natural Language Processing (NLP) systems has grown significantly over the past few years and software products containing NLP features are estimated to globally generate USD 48 billion by 2026. However, current NLP solutions majorly focus on one of the few high-resource languages like English, Spanish or German although there are about 3 billion low-resource language speakers (mainly in Asia and Africa). Such a large portion of the world population is still underserved by NLP systems because of various challenges that developers face when building NLP systems for low-resource languages. In this article, we briefly describe the challenges and delineate how we at NeuralSpace are tackling them.
Challenges for low-resource languages:
Lack of annotated datasets: Annotated datasets are necessary to train Machine Learning (ML) models in a supervised fashion. These models are commonly used to solve specific tasks very accurately, like hate speech detection. However, creating annotated datasets requires human intervention by labelling training examples one by one, making the process usually time-consuming and very expensive given the thousands of examples advanced deep learning models require. Thus, it becomes infeasible to rely on only manual data creation in the long run.
Lack of unlabeled datasets: Unlabeled datasets like text corpora are the precursors to their annotated versions. They are essential for training base models that are later fine-tuned for specific tasks. Hence, approaches to circumvent the lack of unlabeled datasets also become very important.
Supporting multiple dialects of a language: Languages that have multiple dialects are also a tricky problem to solve, especially for speech models. A model trained in a language usually won’t perform great in its different dialects. For example, most unlabeled and annotated datasets available for Arabic are in Modern Standard Arabic. However, for a human-like feeling when interacting with voice or chat assistants for daily use it is too formal for many Arabic speakers. Thus, supporting dialects become necessary for practical use cases.
The list of challenges is growing with every low-resource language and even large corporations and their NLP Software as a Service (SaaS) offerings, such as Google Dialogflow, AWS Lex, or Microsoft LUIS, are understandably only supporting a small number of low-resource languages.
Overcoming the challenges
At NeuralSpace, we have taken on these challenges to solve them once and for all. This is no small endeavor and we must deploy the latest and our own proprietary research and methods to be successful on our mission to break down language barriers for everyone. In the following sections, we will describe the most relevant techniques and methods that we have been using and explain why they provide advantages when working with low-resource languages.
Transfer learning is a way of solving new tasks by leveraging prior knowledge in combination with new information. It is a common phenomenon observed in humans. For example, a random athlete is much more likely to beat a random individual with no athletic background in a physical sport new to both. More importantly, the athlete will likely take fewer resources (time) to learn the new sport.
At NeuralSpace, we base the foundations of our models on language models that are like general athletes who can adapt to a new sport even in low-resource settings (the NeuralSpace athletes need less time to learn any new sport). Base language models themselves do not require “annotated” data and learn generic language capabilities by self-learning in an unsupervised fashion. Nonetheless, they are not very useful for specific tasks like classifying user intents off-the-shelf. So, we must fine-tune these base language models to accurately solve user-specific tasks that normally have very small amounts of annotated data. Consequently, our models learn to solve the tasks accurately despite low resources for annotated data, via transfer learning.
To illustrate, let’s take the case of an e-commerce chatbot. The chatbot is supposed to answer queries and resolve customer issues around delivery time, refunds and product specifications. To form a conversation, the chatbot must first understand the intent of the customer, then a few entities. For example, “Where is my Razer Blade 14 that I ordered on the 4th of December?”, whose intent is classified as “check order status” with the entities “laptop”: “Razer Blade 14” and “date”: ”4th of December”. Thus, we will need a simple intent classification model.
But the problem gets complex in real-world settings, where there can be hundreds of intent categories and support needed for different low-resource languages. It is very expensive to annotate huge training datasets and time-taking to train a well-performing model from scratch. However, as we train models in a transfer learning paradigm, our fine-tuned models can perform much more accurately with fewer instances per intent category. Therefore, developers save data annotation costs and yet gain model performance by using the transfer learning that is implemented in the NeuralSpace Platform. The best part in the end: developers do not even need to think about transfer learning because NeuralSpace’s AutoNLP takes care of it by itself.
Multilingual Learning is a technique where a single model is trained on multiple languages. The assumption is that the model will learn representations that are very similar for similar words and sentences of different languages. Thus, this can also assist cross-lingual transfer learning as knowledge from data for high-resource languages like English can transfer to the model’s representations for low-resource languages like Swahili. This way, base models can perform better on low-resource languages despite the lack of enough text corpora.
From a production perspective, a single multilingual model that supports numerous languages is much easier to scale and requires less storage. On top of that, maintaining one multilingual model is also easier and allows one to quickly upgrade to model architectures with higher accuracies.
To illustrate, let’s take the case of multilingual hate speech detection. A model must be able to detect hate speech in different languages including a few low and high-resource languages. Through multilingual training, we make one model jointly learn on data from various languages, often exceeding 50 or more languages. Hence, it results in overall higher performance across the given languages and especially helps increase performance for low-resource languages. One of NeuralSpace’s winning solutions in the HASOC 2021 competition also used the multilingual training approach (reference).
Multilingual learning also empowers NLP models to generalize and infer for languages it was not fine-tuned on. Exemplarily, if we have fine-tuned a model on English, Hindi, and Marathi data, we can also predict Tamil with reasonably high accuracy assuming the base language model was pre-trained on a very large set of languages including Tamil.
As icing on the cake, it is also cheaper to host the multilingual model as you just need to keep one model live for n languages instead of n monolingual models. In production, where request rates increase and decrease, scaling expenses for one multilingual model will always be less than or equal to n monolingual models.
Data augmentation is a data pre-processing strategy that automatically creates new data without collecting it explicitly. For instance, in a sentiment classification task, “Today is wonderful” can be altered to “today is a great day”. This alteration increases and possibly diversifies training data in an automatic way. Importantly, augmentation should be such that the ground-truth of any new instance does not change, in this case, “positive sentiment”. Unlike other data collection strategies, data augmentation is very cheap, fast, and usually does not require human involvement.
To illustrate, let’s take the example of a Named Entity Recognition (NER) use case. Assume a dataset on a low-resource language, let’s say Hindi, with only 20 annotated instances for each of the given five intents. While the NeuralSpace Platform would still give you a model with fairly high accuracy, it is generally advisable to create more data to increase the robustness of your model. This can easily be achieved by NeuralSpace’s data augmentation application, called Augmentation. Various ways to augment data exist, like synonym matching and back and forth translations. Developers can experiment with these methods in NeuralSpace’s Data Studio. Figure 1 shows an example of a synonym-based data augmentation: “John is visiting New York to host a conference on the 20th of December.” is augmented once and a new training data point is created: “John is attending New York to hold a convention on the 20th of December.”. Thus, by replacing words with one of their synonyms, you can increase the diversity of your dataset.
If we get one augmented version from each instance, it already doubles the amount of data. It makes the process of generating up to 40 instances for each intent very easy, without any manual data collection. One can extrapolate this strategy using a combination of different data augmentation techniques to create even much larger datasets.
Altogether, it is difficult to build models for low-resource languages majorly due to the lack of annotated and even unsupervised data in some cases. Thus, we are left with two options. Either we collect more data or improve our modelling techniques, to get more from less. Manual data collection is expensive but effective, so that is a reliable but usually costly option. Thus, effective modelling techniques that we emphasized above become important.
The problems with low-resource languages are still far from being solved. However, we at NeuralSpace are taking meaningful steps in that direction.
Check out our Documentation for all the Apps and features of the NeuralSpace Platform. Join the NeuralSpace Slack Community to receive updates and discuss about NLP for low-resource languages with fellow developers. Read more about the NeuralSpace Platform on neuralspace.ai.