Tenzin Singhay Bhotia & Aditya Dalmia
Evaluating Open AI's Whisper
What is Whisper?
The news was big when OpenAI open-sourced a multilingual automatic speech recognition (ASR) model that was trained on 680,000 hours of annotated speech data, of which 117,000 hours are not in English and are transcriptions of spoken language in 96 languages.
“We’ve trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on English speech recognition” - OpenAI
The engineering team at NeuralSpace tested Whisper and we can confirm its superb performance on English. It’s astounding how robust it is for various accents and how it could identify just about everything we tried it on, audio with background noise, fast speech, low-quality files, and so on, and so on. We even used it to generate transcripts for videos in Spanish and German, as it is a multilingual model, and it did a good job actually.
Take Andrej Kaparthy’s word for it. He created transcripts for all 322 Lex Fridman Podcasts using Whisper, and they are very accurate.
Whisper: the one-size-fits-all model?
We know how amazing Whisper is, but it begs the question, is it a one-stop solution for all kinds of speech recognition tasks out there, regardless of language? Where does the model actually start to fail? Which business use-cases is it not going to serve that accurately?
We found a couple of areas where it does fail: transcribing difficult non-English words like names of dishes in a menu, and most importantly, transcribing languages other than a few high-resource ones like Spanish, German, Italian and English.
At NeuralSpace we currently work on automating the drive-thru experience for a fast food restaurant using ASR, Language Understanding (Intent + Entity Recognition) and Text to Speech. Their clients’ menus had some fancy dishes that were not the easiest to pronounce. Have a look:
We obtained a few audio samples of people placing their orders, and these had a mix of Greek, Italian and Indian dishes from their menu. But, Whisper was only able to identify 50% of the dishes (10/20 mentioned above) that we had in our sample data.
Check out some of the transcriptions below:
Bear in mind that these results are despite Whisper being trained on Greek, Italian and Hindi data too! Of course, Whisper’s training data distribution is heavily tailored towards English, but a multilingual ASR model could have done a bit better, we hoped. Similarly, there could be proper nouns like product names, names of people, etc. in which cases the model predicts a slight variation of it.
Improvise, Adapt, Overcome
Recent research in ‘Vocab Adaption’, or how a base ASR model like Whisper can be tailored to a certain set of vocabulary, has shown promising results to solve issues like those mentioned above. It involves adapting an existing base model on new words or phrases that it has never seen before, to make it able to recognize those words too!
At NeuralSpace, we are in the process of building a product around ASR, where anyone can adapt our base models to their specific use-case. A lot of adaptation solutions that we saw were based on fine tuning using audio as well as their corresponding transcripts for the new data. This approach often works very well but some companies wanting to use speech technology do not even have speech data. We have taken these observations into account and let our users adapt the ASR models with just a text corpus. The text file could be just words or phrases (example: menu items) or complete sentences from a particular domain (example: law).
For the above task of recognizing the restaurant orders, we adapted our base English model on a text corpus of just 20 dishes from the menu above. The entire process did not even take us five minutes and it was able to transcribe 17 out of 20 dishes correctly. Here are some results on the same sample audios:
A bit of a background, our base model has been trained on 3000 hours of librispeech data only in English. Whisper in contrast has been trained on 680,000 hours of data in multiple languages. This demonstrates the need for adaption so that we can have models that accurately predict certain words or phrases that a customer may require.
Another area where we found Whisper was falling short was in the transcription of low-resource languages. When we tried transcribing speech in Indian languages on real-world data from one of our customer, we unfortunately got a very high Word Error Rate (WER) compared to dedicated models for those respective languages, how we have built them at NeuralSpace. Whisper is called a multilingual model as it has been trained on over a total of 680,000 hours of data, of which 117,000 hours are in languages other than English. Overall, it was not great at transcribing most Indian languages like Punjabi, Malayali, Tamil and Gujarati.
Whisper has been trained majorly on English data only, and even among other languages, the number of hours of data for certain languages is extremely low. Check out page 27 of this PDF for an exact breakdown.
OpenAI had mentioned that all the data they used to train Whisper was data already available on the internet that had been scraped algorithmically. The general bias when using data from the internet is going to be that there are naturally some languages like English, Spanish, Japanese, etc. which are more coming and have much more data compared to other languages like Odia, Punjabi and Yoruba. We always like to point to the Wikipedia article that mentions the percentage of content in each language on the internet. This throws light on the inherent problem with Whisper’s low-resource language recognition and why it has a fairly high WER for languages like Punjabi, even in their own tests. Check out more WER results across languages here.
Is low-resource coverage important? How to make ASR models work in low- resource languages?
The answer is obvious for us at NeuralSpace but for everyone, just think about the 4 billion people who speak one of the low-resource languages as their mother tongue. Of course, it is important! Around 50% of the world population communicates in languages that fall under this category, and to be able to cater to them, you need to have ASR models that are made for these languages. Multilingual speech recognition, especially in low-resource languages, is quite a challenging problem that does not have many pre-existing solutions. There are multiple issues like some of the low-resource languages just being too rare on the internet; or them existing but not organised well enough (for example, audios exist but commonly accepted transcripts don’t).
At NeuralSpace, low-resource language support has been our major focus and we take them very seriously and prioritise building solutions specifically for them. We have built services like Language Understanding (NLU) or Entity Recognition (NER) that are available in more than 100 languages and actually perform very well across the board in all languages. Check our recent benchmarking results here. Our next goal is to do the same for ASR systems as well - stay tuned.
The short answer to what is the most important missing part to build accurate ASR models for low-resource languages is - obviously - more data. We have dedicated and concentrated efforts that are underway to make STT viable for low-resource languages, such as acquiring data through multiple channels like algorithmically scraping from the internet, in-house collection, collection with our many partners (mostly language service providers), and review efforts with humans in the loop, as well as utilising services of third-party data vendors out there. Not only that, of course, we are also working on building model architectures that are specifically tailored to low-resource languages and continuously improve our models so that our customers can achieve super low WERs on real-world data and their use cases.
What does NeuralSpace have to offer?
What NeuralSpace has to offer includes:
adaptation capabilities for ASR: make models recognize custom vocabulary
extensive language support: transcribe for low-resource languages
no-code platform: generating and editing transcriptions, adding timestamps, speaker identification, and vocab adaptation without writing any code (new features coming soon)
scalable infrastructure which offers any preferred deployment method: SaaS, private-cloud/on-premise, and soon on-device as well
On top of that, we are also putting in a lot of research efforts to build robust deep learning models using state-of-the-art technologies, as well as in dedicated data collection in low-resource languages.
We conclude with the to-the-point quote by Open AI’s CEO Sam Altman’s quote on the future of AI models and companies:
“... I think there will be a small handful of fundamental models out there that other people build on … what will happen is there'll be a whole new set of startups that take an existing very large model of the future and tune it, which is not just fine-tuning … there will be a lot of access provided to create a model for medicine or using a computer … those companies will create a lot of value… - Sam Altman”
Now it’s time to head to the NeuralSpace platform to try out our services.
Check out our Documentation to read more about the NeuralSpace Platform and its different services.
Feel free to reach out to us or book a call directly for any discussion.