We're excited to announce that VoiceAI has not only set new standards in Arabic speech-to-text (STT) accuracy but also in Hindi. In these tests, NeuralSpace's STT model surpassed seven other providers, with 138% better performance over OpenAI and 77% over Google. This means that NeuralSpace transcriptions have 1.8 times less errors than Google and 2.4 times less errors than OpenAI on average.
The recent benchmarking of our Hindi model illustrates not just the establishment of a new industry benchmark in language AI, but also our relentless pursuit of advancing our own innovations. What truly differentiates our model is its unparalleled accuracy. Achieving a remarkable 15% relative improvement, compared to our preceding model (as detailed in Table 1).
With over 25,000 hours of audio data, our STT model has been meticulously trained on diverse voices, from people of different ages, genders, accents, and dialects, with varying sound quality. This robust training, complemented by human validation of AI-generated transcriptions, has resulted in a model that outperformed all other vendors.
In our analysis, we found that our STT technology excels even in the most challenging scenarios, such as speech muffled by intense background noise or compromised by microphone disturbances - a common hurdle in call center environments.
With this milestone achievement in Hindi STT accuracy, we’re excited to announce the launch of new speech analytics features in VoiceAI, built on the foundation of our trailblazing speech recognition technology. Meaningful insights begin with precise transcription. Together, these technologies are helping us make significant progress towards our mission of dismantling language barriers in technology.
We used the most common method to test the accuracy of speech-to-text (STT) systems, which is Word Error Rate (WER)*. This metric determines the percentage of words in the STT output that differ from the actual, 100% accurate “ground truth” transcription. The WER is calculated by dividing the total number of errors, which includes substitutions, deletions, and insertions, by the total number of words in the ground truth transcription.
A lower WER indicates a higher accuracy of the STT system. See how even a marginal difference in WER can impact the quality of your transcription.
To ensure a comprehensive evaluation, we calculated the word error rate (WER) across 5 diverse datasets using 2,000 randomly selected audio samples.
For the following Hindi STT benchmark we used WER as a metric across the selected datasets. NeuralSpace’s model achieves the lowest WER (highest accuracy) outperforming seven vendors, with 138% better performance over OpenAI and 77% over Google.
Across the datasets, NeuralSpace consistently ranks among the top performers. Our Hindi STT model achieves an average WER of 18.05, showing its general effectiveness across multiple audio contexts.
Dataset Diversity: From open-source Common Voice (CV11) entries to technical MUCS lectures and Shrutilipi news bulletins, our datasets span a wide audio spectrum. This diversity ensures robust testing of speech-to-text models across various audio qualities and dialects.
Superiority on Shrutilipi: NeuralSpace's performance on the Shrutilipi dataset stands out with a WER of 10.47. This dataset, derived from All India Radio news bulletins, emphasizes the model's capability in understanding and transcribing formal Hindi speech, a crucial feature for professional applications.
Competitive Edge on MUCS: On the MUCS dataset, which comprises technical lectures, NeuralSpace scores a WER of 24.07. This is noteworthy since technical lectures often contain domain-specific terminologies, which can be challenging to transcribe.
Robustness on Gram Vaani: The Gram Vaani dataset, containing telephone-quality speech, presents unique challenges due to its audio quality and diverse regional accents. NeuralSpace's WER of 26.77 is commendable given the inherent difficulties of this dataset. Google, one of its major competitors, records a WER of 56.77 on the same dataset, showing a difference of 29.99 in favor of NeuralSpace.
Universal Language Contribution (ULCA): On the ULCA dataset, which contains audios from varied sources like government TV and radio channels, NeuralSpace achieves a WER of 12.48. Azure, a close competitor, has a score of 27.75, establishing a difference of 15.27.
High-quality data insights start with accurate transcriptions. With the VoiceAI platform, you can generate audio insights to easily analyze post-call data to spot trends in your business, and track agent performance. With accurate STT combined with advanced analytics and translation capabilities, your business and customers get the best solution in the market.
*Even though WER is the most common metric to evaluate STT vendors,it can vary a lot depending on testing methodologies and test sets. Hence, WER should be looked at in a relative way. Read more.
**Tests conducted in October 2023 against Google, Azure, AWS, OpenAI, Deepgram, Speechmatics and SymblAI against NeuralSpace VoiceAI’s upgraded STT model.
The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.
A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!
Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.