In the rapidly evolving world of Speech-to-Text (STT) technology, making an informed choice can seem overwhelming. Yet, the success of your project hinges on this crucial decision. With so many claims about performance and accuracy, how can you navigate the maze of marketing hype to select the right vendor? The answer lies in objective benchmarking.
One of the key factors to consider when evaluating an STT model is the Word Error Rate (WER). WER is a metric used to determine the accuracy of transcriptions produced by an STT system. In this blog post, we will explore what WER is, why it is important, the nuances of calculating WER for different languages, and what is considered a good WER score.
Word Error Rate or WER is a metric used primarily in the field of speech recognition to measure the performance of an automatic speech recognition (ASR) system. WER calculates the minimum number of operations (substitutions, deletions, and insertions) required to change the system's transcription (prediction) into the reference transcription (truth), divided by the number of words in the reference.
WER can range from 0 to infinite. The closer the WER is to 0, the better. WER is often also represented as a percentage. It is usually calculated by just multiplying 100 to it. For example, a WER of 0.15 might also be represented as 15%.
WER is important because it provides:
Provider X has advertised WER for their English model as 4.5, and another has published 7.5 as theirs. We know a lower WER indicates a higher accuracy, so does it mean that provider X is the better provider for you? No, the answer is not that simple.
The method of evaluation that providers X and Y could have used may be completely different. They could have done their evaluation on different test sets (affected by recording quality, noise, accents, etc.) or could have normalized differently. WER is a sensitive metric and these factors can dramatically affect the results.
Hence the need to evaluate all providers on a test-set representative of your use-case, and then compare the results and metrics.
Text normalization is the process of transforming text into a consistent and standardized form. Normalization is a crucial step before calculating WER. It helps ensure that different variations or representations of the same content are treated as equivalent, thereby improving the accuracy and efficiency of text analysis. But it’s not an easy process and can be very nuanced for different languages.
Normalization for English usually involves:
Normalization for other languages like Arabic can become even more challenging owing to the rich morphology, script, and phonetic variations in it. Some additional steps to normalize the script involve:
The need for normalization can be best understood using an example:
Without normalization, WER is 50% because 2 out of 4 words in the predicted sentence need to be substituted to arrive at the truth. On the other hand, with normalization, WER is 0% as the ground truth and the prediction are exactly the same after normalization.
Thus, normalization in the above example helped us accurately measure how well the model was able to convert speech to text, without being affected by how well the text was punctuated. This example is also a clear indicator of how sensitive WER can be to minor features of a transcript.
Human evaluation is also a very important step in the process of choosing the best STT provider. This is because a model could be performing well but the WER figures suggest it’s subpar because of the following reasons:
Hence, it is a good idea to get predictions verified by a human who can speak and read the language.
Head here to try out our interactive WER calculator.
The target WER score often hinges on the unique needs of a particular industry. Generally, a lower WER signifies superior performance. 0% WER represents perfect transcription, albeit a rarity. Typically, WER below 10% is seen as excellent, while scores between 10% and 20% are good.
But this generalization should not necessarily be your guiding star. WER, as we have seen, can vary a lot depending on testing methodologies and test sets. Hence, WER should be looked at in a relative way. Using the same testing strategies to compare results among providers helps make a more informed choice than just considering absolute scores. It's also essential to align WER standards with the specific demands and industry norms of your application.
WER plays a vital role in evaluating the accuracy and reliability of an STT vendor. By understanding WER and its nuances for different languages, along with determining the appropriate range of WER scores based on the specific context, you can make an informed decision on which STT vendor aligns best with your unique requirements and expectations.
Head to https://voice.neuralspace.ai/login to try VoiceAI for free.
The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.
A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!
Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.