Felix Laumann
Comparing NeuralSpace with Hugging Face on Indic languages

Introduction

With more than 14,000 open-source models in 150 languages on its model hub and nearly 500 publicly available datasets, Hugging Face is by any measure the most vibrant Natural Language Processing (NLP) community worldwide. AWS, the Allen Institute for AI, Microsoft, or Google AI are some of its famous users and outstanding solutions have been built with Hugging Face. However, its community has not been very active in languages except the high-resource European languages with a particularly strong focus on English. On Hugging Face’s model hub are, for example, 15 times more models available in English than in the community’s second most popular language, Spanish, and only about 50 models have been trained in India’s national language, Hindi, compared to the almost 6000 models that can process the English language (source).

The purpose of this article is to compare how Hugging Face’s models perform against the ones built by NeuralSpace, a London-based NLP startup focusing on low-resource local languages. The comparison is by no means comprehensive but covers both companies’ Natural Language Understanding (NLU) capabilities in local languages spoken across India.
Datasets
For an as objective as possible assessment, we looked for a collection of datasets spanning from news headlines in Tamil, Gujarati, and Kannada to motivational comments on social media platforms in Tamil and Malayalam, and aggressive content on Facebook in Hinglish (a mix of Hindi and English written in the Latin alphabet).
Results
Our results demonstrate that Hugging Face’s models often struggled in Indic languages, especially in classifying news headlines in Telugu, Kannada, Gujarati. In contrast, NeuralSpace’s NLU models achieved accuracies of 80–95% in these three languages — an improvement of 20–30% compared to Hugging Face. In Hinglish, both Hugging Face and NeuralSpace could not reach accuracies of these levels but NeuralSpace still had a 5% advantage over Hugging Face (61% vs 56%). On the motivational comments in Tamil and Malayalam, both companies achieved comparable results. The comparison was done with both companies’ AutoNLP models and English was chosen as the language on Hugging Face because low-resource Indic languages were not available. See the side-by-side comparison in the following table for the exact numbers.

NeuralSpace Platform
The results for NeuralSpace were achieved by using the NeuralSpace Platform that includes multiple NLP-specific apps such as Translation, Transliteration, Speaker Identification, Speech-to-Text, Text-to-Speech, and, what was used for these experiments, Language Understanding, besides many more yet to come. The philosophy behind it is that every business problem related to language, both in text and audio format, can be dismantled into sub-problems that are solved by one specific app on the NeuralSpace Platform. The apps are structured in such a way that they can be easily plugged together, one after the other, to build NLP pipelines. For example, a voice assistant like Amazon Alexa requires a pipeline consisting of Speech-to-Text, Language Understanding, and Text-to-Speech, potentially also including Speaker Identification or Transliteration. With the NeuralSpace Platform, these building blocks can be used in a drag and drop fashion on its online graphical user interface (GUI), via simple APIs or an intuitive CLI. Users do not need to understand any of the underlying complexities of the state-of-the-art deep learning models powering NeuralSpace’s apps and can simply click on the Train button for an AutoNLP-powered model training.

The NeuralSpace Platform includes a Data Studio where users can prepare their data. They can directly train their models with AutoNLP by clicking on the Train button in the right top corner of the Data Studio.

In the future, NeuralSpace anticipates becoming the leader in NLP for low-resource languages spoken across Africa, the Middle East, and Asia. In these geographies, local languages are spoken by more than three billion people and there is an ever-increasing need to provide technology that supports these languages natively.
Stay tuned for more Benchmarking data! For any questions feel free to reach out to me on Twitter.