top of page
  • Writer's pictureFelix Laumann

Arabic Speech-to-Text: Comparing Results of Top STT Providers

We are thrilled to share our latest benchmarking results for Arabic speech-to-text (STT) services. NeuralSpace has emerged as the leader in accuracy, achieving an impressive 91% average accuracy across various dialects and voices. Our performance surpassed eight other providers including Google, AWS, Azure, Intella, OpenAI and Symbl AI, with a 59% absolute increase in accuracy over IBM. These results demonstrate our commitment to unlocking the power of NLP – regardless of where you are in the world, or the language you speak.

The Value of Accurate STT

STT has a diverse range of applications from transcribing calls and meetings, to automated customer service systems, subtitle creation, speech analytics and more. However, for these use cases to be truly effective, STT accuracy is paramount.

Inaccurate transcriptions can result in misunderstandings and misinterpretations, leading to serious consequences, particularly in fields such as healthcare and legal proceedings. Additionally, inaccurate STT transcriptions can negatively impact user satisfaction, eroding trust and confidence in the product or service it powers, which ultimately hinders user adoption. As such, it’s critical to assess the performance of STT services in the language and dialect of the end users.

The Challenge of Dialects for STT

Developing an accurate STT system requires advanced algorithms and models. The process involves converting complex audio data into text, which requires the system to build a deep understanding of the nuances of language, accents, and dialects it supports.

One of the major challenges for STT systems is dealing with regional dialects. STT models trained on standardized language data may struggle to accurately transcribe spoken language that deviates from the standard.

Although Modern Standard Arabic (MSA) is the formal written language used in most official contexts, it is not the language spoken in daily life by the majority of people living in the Arabic-speaking countries of the Middle East and Northern Africa (MENA). Instead, people speak various regional dialects that can differ significantly in pronunciation, grammar, and vocabulary. To address this challenge, STT models need to be trained on a vast amount of diverse language data that includes regional dialects to improve their accuracy and performance.

Additionally, the accuracy of an STT service is heavily dependent on the quality and clarity of the audio input, with performance decreasing in noisy environments or with low-quality recordings. Integrating linguistic knowledge of regional dialects and adapting acoustic models to specific dialects can help improve the accuracy of STT systems for non-standardized languages. Ensuring that they can perform reliably in real-world situations.

To provide a comprehensive evaluation of our Arabic STT systems, we conducted accuracy tests that compared NeuralSpace's transcriptions to those of eight other service providers, namely Intella, Speechmatics, OpenAI's Whisper, Google, Azure, AWS, IBM, and Symbl AI.

The testing was conducted on five publicly available datasets that included diverse voices of native Arabic speakers speaking a variety of dialects and regional accents.

We used the most common method of testing the accuracy of speech-to-text (STT) systems, which is the Word Error Rate (WER). This metric determines the percentage of words in the STT output that differ from the actual, 100% accurate, so-called “ground truth” transcription. The WER is calculated by dividing the total number of errors, which includes substitutions, deletions, and insertions, by the total number of words in the ground truth transcription.

💡 Ground truth is typically generated by human transcribers who listen to the audio and manually transcribe it into text.

A lower WER indicates higher accuracy of the STT system.

However, WER is sensitive to variations in spelling, punctuation, and capitalization, which can lead to higher error rates even for correct transcriptions. To address this issue, we use a language-specific normalizer to standardize the text and make it less sensitive to such variations, resulting in a more accurate assessment of the STT system's performance.

This benchmark showcases a comparison of the accuracies of various STT service providers. The accuracies are simply calculated by subtracting the WER from 100.

Accuracy = 100 - WER

Test Datasets

The benchmark was conducted using the following datasets:



CommonVoice 11 (link)

The most standard STT dataset in academic and industrial benchmarks. It contains 147 hours of data with 1309 voices from Arabic speakers across various countries.

Fleurs (link)

A dataset created by Google with 10 hours of audio and human-labeled transcription data. The speakers of the train sets are different than speakers from the test set.

MGB (link)

A broad and multi-genre dataset with approximately 1,200 hours of Arabic broadcast audio data, obtained from about 4,000 programmes broadcast on the Arabic TV channel Al Jazeera.

MASC (link)

A dataset that contains 1,000 hours of speech sampled at 16 kHz and crawled from over 700 YouTube channels. The dataset is multi-regional, multi-genre, and multi-dialect.


NeuralSpace has achieved the highest accuracy among the Arabic speech-to-text (STT) providers on all evaluated datasets, with an average accuracy of 90.75%% and a peak accuracy of 95%.

Notably, on the MASC dataset, NeuralSpace achieved an accuracy rate 59% higher than the lowest-performing system (IBM), illustrating a significant disparity in the performance of STT systems across providers.

The table below shows the accuracies of Intella, Speechmatics, OpenAI Whisper, Google, Azure, AWS, IBM, Symbl and NeuralSpace on all of the datasets we benchmarked against.

NeuralSpace has achieved exceptional performance in Arabic dialects by training our speech-to-text (STT) model with carefully sourced and curated data, utilizing the expertise of our team of linguists who are proficient in all dialects. The model is an encoder-decoder transformer-based system that can accurately transcribe speech recordings of varying lengths, even in the presence of background noise or music, multiple speakers, or strong-quality compression.

In Conclusion

NeuralSpace's emphasis on creating accurate Arabic language models has resulted in exceptional STT transcription performance, surpassing industry-leading STT providers in other languages. With a continued commitment to developing advanced algorithms and models that capture the nuances of regional Arabic dialects, NeuralSpace aims to provide seamless and reliable, AI-powered experiences for customers in the Arabic-speaking world.

Get in touch to learn more about NeuralSpace or visit our website. Head to the NeuralSpace Platform to try out our STT service, for free!

653 views0 comments

Recent Posts

See All