Arabic Speech-to-Text: Comparing Results of Top STT Providers

Felix Laumann

The results are in from our latest benchmarking of Arabic speech-to-text (STT) services. NeuralSpace has emerged as the leader in accuracy, achieving an impressive 91% average accuracy across various dialects. Our performance surpassed Google, AWS, Azure, Intella, OpenAI and Symbl AI, with a 59% absolute increase in accuracy over IBM.

The Value of Accurate STT

STT has a diverse range of applications from transcribing calls and meetings, to automated customer service systems, subtitle creation, speech analytics and more. However, for these use cases to be truly effective, STT accuracy is paramount.

Inaccurate transcriptions can result in misunderstandings and misinterpretations, leading to serious consequences, particularly in fields such as healthcare and legal proceedings. Additionally, inaccurate STT transcriptions can negatively impact user satisfaction, eroding trust and confidence in the product or service it powers, which ultimately hinders user adoption. As such, it’s critical to assess the performance of STT services in the language and dialect of the end users.

The Challenge of Dialects for STT

Developing an accurate STT system requires advanced algorithms and models. The process involves converting complex audio data into text, which requires the system to build a deep understanding of the nuances of language, accents, and dialects it supports. One of the major challenges for STT systems is dealing with regional dialects. STT models trained on standardized language data may struggle to accurately transcribe spoken language that deviates from the standard.

Although Modern Standard Arabic (MSA) is the formal written language used in most official contexts, it is not the language spoken in daily life by the majority of people living in the Arabic-speaking countries of the Middle East and Northern Africa (MENA). Instead, people speak various regional dialects that can differ significantly in pronunciation, grammar, and vocabulary. To address this challenge, STT models need to be trained on a vast amount of diverse language data that includes regional dialects to improve their accuracy and performance.

Additionally, the accuracy of an STT service is heavily dependent on the quality and clarity of the audio input, with performance decreasing in noisy environments or with low-quality recordings. Integrating linguistic knowledge of regional dialects and adapting acoustic models to specific dialects can help improve the accuracy of STT systems for non-standardized languages. Ensuring that they can perform reliably in real-world situations.

To provide a comprehensive evaluation of our Arabic STT systems, we conducted accuracy tests that compared NeuralSpace’s transcriptions to those of eight other service providers, namely Intella, Speechmatics, OpenAI’s Whisper, Google, Azure, AWS, IBM, and Symbl AI. The testing was conducted on five publicly available datasets that included diverse voices of native Arabic speakers speaking a variety of dialects and regional accents.

We used the most common method of testing the accuracy of speech-to-text (STT) systems, which is the Word Error Rate (WER). This metric determines the percentage of words in the STT output that differ from the actual, 100% accurate, so-called “ground truth” transcription. The WER is calculated by dividing the total number of errors, which includes substitutions, deletions, and insertions, by the total number of words in the ground truth transcription.

💡 Ground truth is typically generated by human transcribers who listen to the audio and manually transcribe it into text.
A lower WER indicates higher accuracy of the STT system.

However, WER is sensitive to variations in spelling, punctuation, and capitalization, which can lead to higher error rates even for correct transcriptions. To address this issue, we use a language-specific normalizer to standardize the text and make it less sensitive to such variations, resulting in a more accurate assessment of the STT system’s performance. This benchmark showcases a comparison of the accuracies of various STT service providers. The accuracies are simply calculated by subtracting the WER from 100.

Test Datasets

The benchmark was conducted using the following datasets:

Results

NeuralSpace has achieved the highest accuracy among the Arabic speech-to-text (STT) providers on all evaluated datasets, with an average accuracy of 90.75%% and a peak accuracy of 95%. Notably, on the MASC dataset, NeuralSpace achieved an accuracy rate 59% higher than the lowest-performing system (IBM), illustrating a significant disparity in the performance of STT systems across providers.

The table below shows the accuracies of Intella, Speechmatics, OpenAI Whisper, Google, Azure, AWS, IBM, Symbl and NeuralSpace on all of the datasets we benchmarked against.

NeuralSpace has achieved exceptional performance in Arabic dialects by training our speech-to-text (STT) model with carefully sourced and curated data, utilizing the expertise of our team of linguists who are proficient in all dialects. The model is an encoder-decoder transformer-based system that can accurately transcribe speech recordings of varying lengths, even in the presence of background noise or music, multiple speakers, or strong-quality compression.

Conclusion

NeuralSpace’s emphasis on creating accurate Arabic language models has resulted in exceptional STT transcription performance, surpassing industry-leading STT providers in other languages. With a continued commitment to developing advanced algorithms and models that capture the nuances of regional Arabic dialects, NeuralSpace aims to provide seamless and reliable, AI-powered experiences for customers in the Arabic-speaking world.

Get in touch to learn more about NeuralSpace or visit our website. Head to the NeuralSpace VoiceAI Platform to try out our STT service, for free!

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.

  • JKDV
  • EVEV
  • EV
  • dfdb
  • dfb

Subscribe to our newsletter

Featured

Fast-Track Content Localization with NeuralSpace LocAI
Insights into how the adoption of AI technology slashes the content turnaround time by up to half in our experiment.
April 3, 2024
ABS-CBN Doubles Localization Speed with LocAI
Together, we've created LocAI, a content localization platform that will broaden the reach of its programming through digital distribution.
April 1, 2024
Maximizing Content Reach: How Broadcasters Are Leveraging AI To Unlock Global Growth
Explore key trends and challenges shaping the media industry in 2024, and three innovative ways in which AI is unlocking global growth for streaming services.
February 21, 2024
The Self-Improving AI Advantage with LocAI
LocAI can auto-generate scripts and translated subtitles using self-improving AI that’s finetuned on your data.
February 21, 2024