VoiceAI: Record Accuracy in Hindi Speech-to-Text

Aditya Dalmia

We're excited to announce that VoiceAI has not only set new standards in Arabic speech-to-text (STT) accuracy but also in Hindi. In these tests, NeuralSpace's STT model surpassed seven other providers, with 138% better performance over OpenAI and 77% over Google. This means that NeuralSpace transcriptions have 1.8 times less errors than Google and 2.4 times less errors than OpenAI on average.

Setting a new standard in Hindi speech recognition

The recent benchmarking of our Hindi model illustrates not just the establishment of a new industry benchmark in language AI, but also our relentless pursuit of advancing our own innovations. What truly differentiates our model is its unparalleled accuracy. Achieving a remarkable 15% relative improvement, compared to our preceding model (as detailed in Table 1).

With over 25,000 hours of audio data, our STT model has been meticulously trained on diverse voices, from people of different ages, genders, accents, and dialects, with varying sound quality. This robust training, complemented by human validation of AI-generated transcriptions, has resulted in a model that outperformed all other vendors. 

In our analysis, we found that our STT technology excels even in the most challenging scenarios, such as speech muffled by intense background noise or compromised by microphone disturbances - a common hurdle in call center environments. 

With this milestone achievement in Hindi STT accuracy, we’re excited to announce the launch of new speech analytics features in VoiceAI, built on the foundation of our trailblazing speech recognition technology. Meaningful insights begin with precise transcription. Together, these technologies are helping us make significant progress towards our mission of dismantling language barriers in technology.

Table 1: NeuralSpace Hindi speech-to-text average word error rate (WER). Calculation: Change in WER / Previous WER

Benchmarking Methodology

We used the most common method to test the accuracy of speech-to-text (STT) systems, which is Word Error Rate (WER)*. This metric determines the percentage of words in the STT output that differ from the actual, 100% accurate “ground truth” transcription. The WER is calculated by dividing the total number of errors, which includes substitutions, deletions, and insertions, by the total number of words in the ground truth transcription.

WER Calculation

A lower WER indicates a higher accuracy of the STT system. See how even a marginal difference in WER can impact the quality of your transcription.

Table 2: The comparison text for STT providers shows how the recognized output compares to the reference. Words in red indicate the errors with substitutions in italics, deletions being crossed out, and insertions indicated by an underscore. (Transcription 1 Audio; Transcription 2 Audio)

Test dataset

To ensure a comprehensive evaluation, we calculated the word error rate (WER) across 5 diverse datasets using 2,000 randomly selected audio samples.

Table 3: Test dataset descriptions


For the following Hindi STT benchmark we used WER as a metric across the selected datasets. NeuralSpace’s model achieves the lowest WER (highest accuracy) outperforming seven vendors, with 138% better performance over OpenAI and 77% over Google.

Lower WER is better

Dataset Benchmarking Results

Across the datasets, NeuralSpace consistently ranks among the top performers. Our Hindi STT model achieves an average WER of 18.05, showing its general effectiveness across multiple audio contexts.

Table 4: WER for different providers and datasets
Lower WER is better

Dataset Diversity: From open-source Common Voice (CV11) entries to technical MUCS lectures and Shrutilipi news bulletins, our datasets span a wide audio spectrum. This diversity ensures robust testing of speech-to-text models across various audio qualities and dialects.

Superiority on Shrutilipi: NeuralSpace's performance on the Shrutilipi dataset stands out with a WER of 10.47. This dataset, derived from All India Radio news bulletins, emphasizes the model's capability in understanding and transcribing formal Hindi speech, a crucial feature for professional applications.

Competitive Edge on MUCS: On the MUCS dataset, which comprises technical lectures, NeuralSpace scores a WER of 24.07. This is noteworthy since technical lectures often contain domain-specific terminologies, which can be challenging to transcribe. 

Robustness on Gram Vaani: The Gram Vaani dataset, containing telephone-quality speech, presents unique challenges due to its audio quality and diverse regional accents. NeuralSpace's WER of 26.77 is commendable given the inherent difficulties of this dataset. Google, one of its major competitors, records a WER of 56.77 on the same dataset, showing a difference of 29.99 in favor of NeuralSpace.

Universal Language Contribution (ULCA): On the ULCA dataset, which contains audios from varied sources like government TV and radio channels, NeuralSpace achieves a WER of 12.48. Azure, a close competitor, has a score of 27.75, establishing a difference of 15.27.

Leverage accuracy for advanced audio insights 

High-quality data insights start with accurate transcriptions. With the VoiceAI platform, you can generate audio insights to easily analyze post-call data to spot trends in your business, and track agent performance. With accurate STT combined with advanced analytics and translation capabilities, your business and customers get the best solution in the market. 

Check out our website to learn more or sign up to the VoiceAI platform to try our speech to text service, for free! 


*Even though WER is the most common metric to evaluate STT vendors,it can vary a lot depending on testing methodologies and test sets. Hence, WER should be looked at in a relative way. Read more

**Tests conducted in October 2023 against Google, Azure, AWS, OpenAI, Deepgram, Speechmatics and SymblAI against NeuralSpace VoiceAI’s upgraded STT model.

Join us for an introduction to our latest innovation, VoiceAI.

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.

  • JKDV
  • EVEV
  • EV
  • dfdb
  • dfb

Subscribe to our newsletter


Introducing Our Most Exciting VoiceAI Feature Yet: Lifelike AI Voices
We're excited to introduce AI voices on VoiceAI, specifically crafted for conversational scenarios. Whether you're developing a speech-based chatbot, a voice assistant, or a conversational agent, these new voices are designed to make your interactions more realistic, lifelike, and engaging.
December 11, 2023
Text-to-Speech 101: The Ultimate Guide
An introduction to TTS, exploring its origins, how it works, applications, benefits, and the exciting possibilities it holds.
December 10, 2023
Enhancing Customer Interactions in CCaaS: The Power of Dialectal TTS Voices
Join us as we delve into the next chapter of CCaaS, where cutting-edge technology and customer-centric strategies converge to redefine the standards of customer interaction.
December 10, 2023
Word Error Rate 101: Your Guide to STT Vendor Evaluation
One of the key factors to consider when evaluating an STT model is the Word Error Rate (WER).
December 4, 2023