Hacker News new | ask | show | jobs
by rhizome 2743 days ago
The built in offline Android speech recognizer is really bad. Giants like Google and Facebook are blessed with data, and so they can train state of the art speech recognition models (much much better than what you get out of the built in Android recognizer) and then provide speech recognition as a service. They can control the recognition because it happens on their servers and is independent of Android or any other OS.

Is the implication that offline Android recognition does not train on the owner's voice at all? I imagine a lot of phones these days are at least as powerful as the Pentium 200s used to train (successfully!) Dragon Dictate et al 20+ years ago.

2 comments

Well let's forget the offline android recognizer. That's the one that's built in and doesn't go to Google via the internet to get a better accuracy transcription. It's fairly good for what it is but doesn't come close to the accuracy when you go to the google recognition servers via their API's. That's because the models they offer via the recognition services are much larger, robust and better than what you get straight out of Android. These services offered by companies such as Google do not adapt the acoustic model to individual speakers and are therefore known to be speaker independent.

Secondly, when I say "train", it is in a totally different context than how you seem to be using the term. You are using it in the context of adapting an acoustic model to a individual speaker to improve the performance. I am talking about building the initial model. Typical RNN or even convolution based algorithms require a lot of time and processing power to train. What's even harder to get than the processing power though is of course, data to train off of.

I think you're making a distinction without a difference, since there is (or has been for a long time) an initial model supplied by personal recognition devices/software/etc., too. And sure, if it's not trainable by the user it's going to be using a generalized model. There are tradeoffs there, and the point of my comment was that for a personal speech recognizer it makes sense that it be trainable by the user, especially when the hardware is powerful enough.
This is false. The distinction I was making is very real. The "initial models" you are talking about are small and weak, nowhere near as robust or powerful as the models trained and used by Google/Microsoft/etc on their own servers. State of the art neural networks based recognizers need serious hardware for training, orders of magnitudes more than what is available in smart phones/personal commodity hardware (unless it's massively clustered and distributed).

Secondly, the trained model itself is very big just for storage, and inference against the model is also resource intensive. This is why Android/Google maps/search/etc go out to the google's backend recognition servers for speech to text before falling back on the shitty (but relatively good) offline inline model (that may not even be using state of the art speech recognition techniques and may be using old school GMM based recognizers).

Finally, the large models trained on the backend servers using their distributed computing infrastructure are extremely more accurate than the shitty fallback model, so speaker dependent adaptations aren't necessary. If you can get very very good performance from a speaker independent model, why would you put the extra effort to make speaker dependent adaptations if the gain is very marginal? Not to mention the fact that speaker independent models are more useful in more situations and are extremely powerful. Google for instance can caption videos automatically using speech recognition, which is amazing. If the models were speaker dependent they wouldn't be able to do that. That's why the focus has been so much towards speaker independent models.

> The built in offline Android speech recognizer is really bad.

I totally disagree. Compared to Sphinx it is still lightyears better.

To wit I use it for my android based home automation voice recognition and even from a distance with background noise it still works about >90% accuracy. My original tests with Sphinx in a similar environment garnered about 30%.

I completely agree with you that it's much better than Sphinx/Pocketsphinx. It's even much better than Microsoft's built in speech recognizer that's been around since XP. But it is still very bad compared to the recognition available via Google's voice API and that was the point. Also, I was trying to explain that given the types of models used today for recognition, inherently models trained and hosted somewhere are going to be bigger and more accurate than ones deployed on the field.