VoiceAI: Record Accuracy in Hindi Speech-to-Text

Aditya Dalmia

We're excited to announce that VoiceAI has not only set new standards in Arabic speech-to-text (STT) accuracy but also in Hindi. In these tests, NeuralSpace's STT model surpassed seven other providers, with 138% better performance over OpenAI and 77% over Google. This means that NeuralSpace transcriptions have 1.8 times less errors than Google and 2.4 times less errors than OpenAI on average.

Setting a new standard in Hindi speech recognition

The recent benchmarking of our Hindi model illustrates not just the establishment of a new industry benchmark in language AI, but also our relentless pursuit of advancing our own innovations. What truly differentiates our model is its unparalleled accuracy. Achieving a remarkable 15% relative improvement, compared to our preceding model (as detailed in Table 1).

With over 25,000 hours of audio data, our STT model has been meticulously trained on diverse voices, from people of different ages, genders, accents, and dialects, with varying sound quality. This robust training, complemented by human validation of AI-generated transcriptions, has resulted in a model that outperformed all other vendors. 

In our analysis, we found that our STT technology excels even in the most challenging scenarios, such as speech muffled by intense background noise or compromised by microphone disturbances - a common hurdle in call center environments. 

With this milestone achievement in Hindi STT accuracy, we’re excited to announce the launch of new speech analytics features in VoiceAI, built on the foundation of our trailblazing speech recognition technology. Meaningful insights begin with precise transcription. Together, these technologies are helping us make significant progress towards our mission of dismantling language barriers in technology.

Table 1: NeuralSpace Hindi speech-to-text average word error rate (WER). Calculation: Change in WER / Previous WER

Benchmarking Methodology

We used the most common method to test the accuracy of speech-to-text (STT) systems, which is Word Error Rate (WER)*. This metric determines the percentage of words in the STT output that differ from the actual, 100% accurate “ground truth” transcription. The WER is calculated by dividing the total number of errors, which includes substitutions, deletions, and insertions, by the total number of words in the ground truth transcription.

WER Calculation

A lower WER indicates a higher accuracy of the STT system. See how even a marginal difference in WER can impact the quality of your transcription.

Table 2: The comparison text for STT providers shows how the recognized output compares to the reference. Words in red indicate the errors with substitutions in italics, deletions being crossed out, and insertions indicated by an underscore. (Transcription 1 Audio; Transcription 2 Audio)

Test dataset

To ensure a comprehensive evaluation, we calculated the word error rate (WER) across 5 diverse datasets using 2,000 randomly selected audio samples.

Table 3: Test dataset descriptions

Results

For the following Hindi STT benchmark we used WER as a metric across the selected datasets. NeuralSpace’s model achieves the lowest WER (highest accuracy) outperforming seven vendors, with 138% better performance over OpenAI and 77% over Google.

Lower WER is better

Dataset Benchmarking Results

Across the datasets, NeuralSpace consistently ranks among the top performers. Our Hindi STT model achieves an average WER of 18.05, showing its general effectiveness across multiple audio contexts.

Table 4: WER for different providers and datasets
Lower WER is better

Dataset Diversity: From open-source Common Voice (CV11) entries to technical MUCS lectures and Shrutilipi news bulletins, our datasets span a wide audio spectrum. This diversity ensures robust testing of speech-to-text models across various audio qualities and dialects.

Superiority on Shrutilipi: NeuralSpace's performance on the Shrutilipi dataset stands out with a WER of 10.47. This dataset, derived from All India Radio news bulletins, emphasizes the model's capability in understanding and transcribing formal Hindi speech, a crucial feature for professional applications.

Competitive Edge on MUCS: On the MUCS dataset, which comprises technical lectures, NeuralSpace scores a WER of 24.07. This is noteworthy since technical lectures often contain domain-specific terminologies, which can be challenging to transcribe. 

Robustness on Gram Vaani: The Gram Vaani dataset, containing telephone-quality speech, presents unique challenges due to its audio quality and diverse regional accents. NeuralSpace's WER of 26.77 is commendable given the inherent difficulties of this dataset. Google, one of its major competitors, records a WER of 56.77 on the same dataset, showing a difference of 29.99 in favor of NeuralSpace.

Universal Language Contribution (ULCA): On the ULCA dataset, which contains audios from varied sources like government TV and radio channels, NeuralSpace achieves a WER of 12.48. Azure, a close competitor, has a score of 27.75, establishing a difference of 15.27.

Leverage accuracy for advanced audio insights 

High-quality data insights start with accurate transcriptions. With the VoiceAI platform, you can generate audio insights to easily analyze post-call data to spot trends in your business, and track agent performance. With accurate STT combined with advanced analytics and translation capabilities, your business and customers get the best solution in the market. 

Sign up to VoiceAI to try it for free.

Contact our sales team with any questions about our enterprise pricing and bespoke solutions. We’re here to help.

Footnotes

*Even though WER is the most common metric to evaluate STT vendors,it can vary a lot depending on testing methodologies and test sets. Hence, WER should be looked at in a relative way. Read more

**Tests conducted in October 2023 against Google, Azure, AWS, OpenAI, Deepgram, Speechmatics and SymblAI against NeuralSpace VoiceAI’s upgraded STT model.

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.

  • JKDV
  • EVEV
  • EV
  • dfdb
  • dfb

Subscribe to our newsletter

Featured

Enhancing Call Center Efficiency with Advanced Speech Analytics
Customer finds solution in NeuralSpace's VoiceAI analytics API, to significantly transform their speech analytical capabilities.
May 24, 2024
Leading the way in Tagalog Speech Recognition
Our model outperforms Google, Azure, and OpenAI, with an 81.55% higher accuracy than Google.
May 20, 2024
Maximizing Localization Efficiency with LocAI Analytics
Delve into how LocAI addresses challenges of team management, time zones, and freelancing to empower teams in the dynamic subtitling landscape
May 3, 2024
Fast-Track Content Localization with NeuralSpace LocAI
Insights into how the adoption of AI technology slashes the content turnaround time by up to half in our experiment.
April 3, 2024