top of page
  • Writer's pictureMehar Bhatia

Named Entity Recognition at your fingertips: A comparative study of various AutoNLP engines (Part 2)

We, at NeuralSpace, are back with another comparative study. This time with Azure Cognitive Services and Google Cloud’s Dialog Flow for Named Entity Recognition (NER) with automatic model training (called AutoNLP).

In Part 1, we performed a similar study for AutoNLP Entity Extraction services of Hugging Face and AWS Comprehend. Check out the previous article for more details.

In this article, we will assess the NeuralSpace, Azure Cognitive Services and Google Cloud’s DialogFlow AutoNLP platforms comprehensively for entity extraction in a multilingual setting.

Let us start in terms of the number of supported languages.

  • NeuralSpace: 95 languages.

Low-resource languages including from the Indian Subcontinent, Middle East, Asia and Africa.

List of supported languages:

  • Azure Cognitive Services: 92 languages.

New languages support Xhosa, Yiddish, Zulu, Thai, Filipino, Pashto, Esperanto!

List of supported languages:

  • Google Cloud’s Dialog Flow:

List of supported languages:

A user can also specify a locale as a specific region or country for a root language. We specify more about the overall user experience for entity recognition on this platform at the end of the article.

To maintain a fair comparison with our previous study, we use the same datasets, publicly available on Hugging Face. The links are mentioned below.























Any AutoNLP pipeline mainly consist of the following steps. Let us walk through each part and compare both platforms.

Stage 1: Upload Data (Training/Testing examples)

On the NeuralSpace Platform, a user can easily upload their data in either CSV, Rasa or NeuralSpace JSON formats. As seen in the figure below, there are sample files given for the user along with commands for quick dataset convertors in other various formats. Easy peesy, no documentation is even required here.

On the other hand, on Azure’s Cognitive Services platform, the dataset must be converted to the following JSON format and no converter scripts are provided. Well, not at all convenient if the user does not have any coding experience! After that, one must make a new container under their Azure storage account to upload the data.

Another disadvantage is that a user must upload a minimum of 10 txt files (one example per line) as their training data in addition to the above JSON file which contains the entity labels mapping to the occurrence in the txt file with the corresponding line number.

Needless to say, this is a lot of effort especially if you are using Azure as a cloud service for the first time. Thankfully, the documentation is comprehensive, but should something so simple be really that complicated?

Coming to the place where all of the training and evaluation data is prepared, both platforms provide the same (NeuralSpace calls it Data Studio, Azure calls it Language Studio) where a user can easily add examples, tag entities, remove, rename, etc. One advantage with Azure’s Language Studio is that one can upload an entire document and tag entities on the go, while with NeuralSpace, you must tag each sentence separately.

Below is a snippet of Azure Cognitive Service’s Language Studio:

Below is a snippet of NeuralSpace’s Data Studio

Stage 2: Train your model

Both platforms provide a simple process to start training models. A drawback with Azure’s platform is that a user can train only one model at a time, whereas the NeuralSpace platform allows a user to run up to five training jobs parallelly to extract the best model performance.

Snippet of NeuralSpace platform. Train your models in just a click!

The metrics presented on both platforms to evaluate the performance are also different. Azure’s Entity Extraction considers the traditional classification metrics for NER evaluation, i.e., micro-averaging precision, recall and F1-score.

As mentioned in our prior blog post, at NeuralSpace we believed that these traditional metrics may not be the best approach to evaluate and further improve a NER system. A named entity can be made up of multiple tokens, so a full-entity accuracy would be desirable. Also, this simple schema of calculating an F1-score ignores the possibility of partial matches or other scenarios when the NER system gets the named-entity surface string correct but the type wrong.

For this reason, NeuralSpace reports two unique metrics for comparing performance: strict F1 score and partial F1 score. For more information on why these metrics are more comprehensive than macro and micro-averaging F1 scores, we would like to direct readers to David Batista’s blog post.

Stage 3: Deploy your model

Deploying your model is a click away on either platform. One catch with Azure’s platform is that a user can deploy only a single model instance with no mention about the corresponding RPS (requests per second) that it can handle or the latency (response time) that a user can expect. The NeuralSpace Platform provides instant information about the same, so that the user can know in advance what to expect from the model in terms of response time.

Snippet to deploy and scale model replicas using NeuralSpace platform.

Stage 4: Test your model

On NeuralSpace, a user can test a deployed model and extract relevant entities using the best trained model with high supporting confidence scores.

On the other hand, we report a bug in the response provided by Azure’s test API.

As seen in the above two figures, the calculation of length of the word is incorrect for Hindi and Bengali, i.e, the calculation does not take into consideration the vowel signs in Hindi and Bengali (also known as matras). Hence, the offset is also mis-calculated. This causes further glitches for users who aim to integrate the same with their product or perform other post-processing tasks. Overall, this can reveal that Azure’s Cognitive Services has not performed a thorough quality check for performance on these two low-resource languages.

Stage 5: Improve your model

Both platforms follow the approach of Feedback Driven Learning where a user has insights to the performance of each entity so that they know if more data needs to be added to enhance results.

Benchmarking Results

Since Strict and Partial F1 scores could not be calculated from Azure’s Test API due to incorrect length and offset calculation, we decided to calculate the micro F1 score from NeuralSpace model results in order to maintain a fair comparison.

Overall, it can be seen that with much less market experience than its competitors and a brand-new solution, NeuralSpace’s AutoNLP achieved higher or comparable results against Azure as an established provider across most tested languages.

User Scenarios and associated pricing plans

Let us take the same scenario as we did in our preceding blog post To reiterate, the scenario is as follows. Developers from a mid-sized chatbot company would like to use AutoNLP entity extraction to specifically facilitate the intelligence of their chatbots in targeted domains like insurance and healthcare in a multilingual region. NER capabilities are one of the most important modules in building such conversational bots and dialogue systems.

Let us assume the following. The developers at the chatbot company:

  1. Want a throughput of 10 requests per second.

  2. Will make 500,000 API calls to parse user messages per month.

  3. Will train their AutoNLP multilingual NER using 5 training jobs.

We divide the pricing into three sub-parts, namely, i) training costs, ii) deployment costs, iii) Inference costs using APIs.

i) Training costs:

In terms of training costs, NeuralSpace’s AutoNLP is currently priced at a fixed rate of $3 per training job while Azure Cognitive Services (similar to AWS Comprehend) charges $3 per hour of training.

ii) Deployment costs:

Since our chosen company wants a throughput of 10 requests per second, and NeuralSpace’s AutoNLP promises a throughput of 5 requests per second for each deployed replica, we needed to deploy 2 replicas of the model on their platform. One replica denotes 1 instance of the model. Deploying 1 replica costs USD 0.5 per day, making the total cost to be 0.5 X 30 (number of days) X 2 (number of replicas). We approximate the code to be the same for Azure as there is no mention about RPS.

iii) Inference costs:

Since our scenario company estimates to generate 500,000 API calls to parse user messages in one month, and NeuralSpace’s AutoNLP charges USD 0.007 per request, the total cost on their platform would be 500,000 (number of API calls) X USD 0.007, which is USD 3500 for one month.

For Azure Creator Studio, the cost is calculated as follows. To parse 500,000 API calls, Azure provides an inference cost at the rate of USD 1 per 1000 text records. Hence the total cost to parse 500,000 will be USD 500 (500,000/1,000).

Overall user experience

While using all three platforms, we felt that the NeuralSpace no-code user interface and CLI were the easiest to use thanks to their well-presented documentation, user tutorials and explainer videos on YouTube. The seamless pricing calculator provided by NeuralSpace was also very convenient to use and instant updates helped the users to be more aware and take note of their current costs.

Google Cloud’s DialogFlow came with various difficulties and overall, it was much harder to use. DialogFlow has two types of editions:

1. DialogFlow Essentials (ES):

The standard edition of Dialogflow and most people are familiar with this one. Within Dialogflow ES, you can understand the intent of the user by providing training sentences manually to your agent. The agent is then trained to understand these sentences and create a reply for it. A user can also manually add entities, define synonyms and add regex expressions for entities (similar to NeuralSpace platform). However, after you add your response, you can instantly test your model as a chatbot in the interface. There is no mention about any performance metrics of the model. Our assumption is that DialogFlow does not use a state-of-the-art architecture in their backend, rather a rule-based system or simple machine learning algorithm. On the other hand, NeuralSpace’s Language Undestanding app is driven by the best deep learning state-of-the-art architecture providing the best models for your application.

2. DialogFlow Customer Experiment (CX):

Dialogflow CX is a relatively new version of Dialogflow and is not yet used by many people. In general, Dialogflow CX is not recommended for many use cases, unless you have a very sophisticated chatbot.

Update to the previous blog: NeuralSpace now provides a data bulk upload feature that allows users to instantly add their datasets, along with various datasets in different domains and languages that users can import and directly use based on their use-case.

We hope that this blog has provided you with insights and will help you to choose the best language-agnostic AutoNLP entity extraction engine for your specific use case.

At NeuralSpace, we will be happy to connect with you if you would like a demo or have any questions or feedback to help us improve our product. We aim to provide a powerful resource to accelerate your pipelines and empower the next billion users to use the Internet in the language and mode of their choice. Together we can contribute to the engineering and research community.


The NeuralSpace Platform is live, test and try it out by yourself! Early sign-ups get $500 worth of credits — what are you waiting for?

Join the NeuralSpace Slack Community to connect with us. Also, receive updates and discuss topics in NLP for low-resource languages with fellow developers and researchers.

Check out our Documentation to read more about the NeuralSpace Platform and its different Apps.