| You're doing awesome (arduous) work. The text normalization is especially a total bear. I feel your pain. Limiting your text to one file is good in many ways because it allows you to scope down the amount of work needed to do a comparison (but it's a big systematic risk, but hey, there are only so many hours in the day). Your previous blog post helps in understanding how much work needs to go into comparing speech services. It's super common to undervalue just how much processing a human is doing innately while listening to audio; hearing words, feeling out ideas, resolving ambiguities, etc. So, it's awesome to see deep work into it (besides the speech teams working on these problems like at Google, Baidu, Microsoft, Deepgram [btws I'm a founder of Deepgram]). I wouldn't be so quick to say the differences in WER should be attributed to how 'modern' the system is. It's more about the areas they play in; what audio type they care about, what training datasets they use, what post processing they do, and language models they choose to apply. (Speed/TurnAroundTime gives you a much better indication of how modern a system is.) For many speech transcription systems, they focus on specific types of audio as their target market. There are ~4 main types: phone (customer support/sales), broadcast (news/podcast/videos), command and control (siri/google assistant), and ambient (meetings/lectures/security). Google's video model is perfect for what you are doing (broadcast/podcast, 2 dudes talking into probably pretty good mics). In other instances the results will be very different (if you compared phone calls, for example). It won't be different just in accuracy, but also speed (throughput and latency), price, and reliability. It's awesome to see an in depth comparison being discussed broadly. Speech interfacing and understanding is just getting started. We're still at the tip of the Intelligence Revolution and there's still a long way to go. The scale of compute and data is huge, even to bring just one language up to snuff. Aside: It's a dirty little secret that there actually aren't 20 different speech recognition companies in the world using 20 different systems. There are only a handful (many use Google and tweak the outputs). They are mostly doing one of four things: using old and aged tech, using old but well-oiled tech (like Google, this takes a ton of manpower and no other company spends the money to do it), using an open source spinoff (like Kaldi or Mozilla), building your own from scratch (like Deepgram), or reselling someone else's. If you care about current times, this is a reasonably good finger in the wind in Sept. 2018: Use Google if you are doing command and control or broadcast audio, do not use Google if you are doing meetings or phone calls or you need a reliable system (it's unreliable at scale). DO use Google in all cases (even phone/meeting) for audio that is in a language other than English (no other company is even close). Use Google to prototype systems and teach yourself about how to use a speech recognition API and what results to expect as a baseline. Do not use Google if you need scale and speed and reliability and affordability. Do not use Google if you need to use your own vocabulary or if your audio has repetitive things being said in it that have accents or jargon (like call centers). In tat case, use a company that can do a true custom acoustic model and vocabulary for that (like Deepgram). There are only a few companies that will consider doing this (and Google is not one of them. Expect that many more things are going to be addressed. Think of it like: what can a human do? A human can jump into a conversation and quickly tell you: there are 3 people, speaking about rebuilding a feature in the main code, two people are male, one is female, male1 and female1 are doing most of the talking in the beginning, then it's the two dudes at the end, it sounds like the recording is of a meeting they are having, they never came to a definitive conclusion and next steps, they spent 80 minutes in the meeting. All of that (and I'm sure more) will be done by machine in the future. |