Hacker News new | ask | show | jobs
by computerex 2743 days ago
Well let's forget the offline android recognizer. That's the one that's built in and doesn't go to Google via the internet to get a better accuracy transcription. It's fairly good for what it is but doesn't come close to the accuracy when you go to the google recognition servers via their API's. That's because the models they offer via the recognition services are much larger, robust and better than what you get straight out of Android. These services offered by companies such as Google do not adapt the acoustic model to individual speakers and are therefore known to be speaker independent.

Secondly, when I say "train", it is in a totally different context than how you seem to be using the term. You are using it in the context of adapting an acoustic model to a individual speaker to improve the performance. I am talking about building the initial model. Typical RNN or even convolution based algorithms require a lot of time and processing power to train. What's even harder to get than the processing power though is of course, data to train off of.

1 comments

I think you're making a distinction without a difference, since there is (or has been for a long time) an initial model supplied by personal recognition devices/software/etc., too. And sure, if it's not trainable by the user it's going to be using a generalized model. There are tradeoffs there, and the point of my comment was that for a personal speech recognizer it makes sense that it be trainable by the user, especially when the hardware is powerful enough.
This is false. The distinction I was making is very real. The "initial models" you are talking about are small and weak, nowhere near as robust or powerful as the models trained and used by Google/Microsoft/etc on their own servers. State of the art neural networks based recognizers need serious hardware for training, orders of magnitudes more than what is available in smart phones/personal commodity hardware (unless it's massively clustered and distributed).

Secondly, the trained model itself is very big just for storage, and inference against the model is also resource intensive. This is why Android/Google maps/search/etc go out to the google's backend recognition servers for speech to text before falling back on the shitty (but relatively good) offline inline model (that may not even be using state of the art speech recognition techniques and may be using old school GMM based recognizers).

Finally, the large models trained on the backend servers using their distributed computing infrastructure are extremely more accurate than the shitty fallback model, so speaker dependent adaptations aren't necessary. If you can get very very good performance from a speaker independent model, why would you put the extra effort to make speaker dependent adaptations if the gain is very marginal? Not to mention the fact that speaker independent models are more useful in more situations and are extremely powerful. Google for instance can caption videos automatically using speech recognition, which is amazing. If the models were speaker dependent they wouldn't be able to do that. That's why the focus has been so much towards speaker independent models.