| HN Mirror

This is false. The distinction I was making is very real. The "initial models" you are talking about are small and weak, nowhere near as robust or powerful as the models trained and used by Google/Microsoft/etc on their own servers. State of the art neural networks based recognizers need serious hardware for training, orders of magnitudes more than what is available in smart phones/personal commodity hardware (unless it's massively clustered and distributed).

Secondly, the trained model itself is very big just for storage, and inference against the model is also resource intensive. This is why Android/Google maps/search/etc go out to the google's backend recognition servers for speech to text before falling back on the shitty (but relatively good) offline inline model (that may not even be using state of the art speech recognition techniques and may be using old school GMM based recognizers).

Finally, the large models trained on the backend servers using their distributed computing infrastructure are extremely more accurate than the shitty fallback model, so speaker dependent adaptations aren't necessary. If you can get very very good performance from a speaker independent model, why would you put the extra effort to make speaker dependent adaptations if the gain is very marginal? Not to mention the fact that speaker independent models are more useful in more situations and are extremely powerful. Google for instance can caption videos automatically using speech recognition, which is amazing. If the models were speaker dependent they wouldn't be able to do that. That's why the focus has been so much towards speaker independent models.