Hacker News new | ask | show | jobs
by pruthvishetty 1122 days ago
This looks huge. Anyone know how this compares with Whisper in terms of quality and speed?
1 comments

according to their blog post[1], MMS achieves ~half the error rate on words, while supporting 11x more languages. pretty impressive.

[1] https://ai.facebook.com/blog/multilingual-model-speech-recog...

I wonder what the performance is on English specifically.

Edit: Just checked the paper, it seems to be worse[1][2] but feel free to correct me.

I feel like they should've just taken the Whipser architecture, scaled it, and scaled the dataset as they did.

[1] Page: https://i.imgur.com/bq15Tno.png

[2] Paper: https://scontent.fcai19-5.fna.fbcdn.net/v/t39.8562-6/3488279...

It's worse on English and a lot of other common languages (see Appendix C of the paper). It does better on less common languages like Latvian or Tajik, though.
Which implies, Whisper just hasn't focused on those languages? Seems disingenuous to make the claim that the error rate has halved, when it's worse in the apex language
My guess is wav2vec performs better on low resource than whisper.
lack of labels on graph axes should be a crime