|
|
|
|
|
by rhizome
2743 days ago
|
|
I think you're making a distinction without a difference, since there is (or has been for a long time) an initial model supplied by personal recognition devices/software/etc., too. And sure, if it's not trainable by the user it's going to be using a generalized model. There are tradeoffs there, and the point of my comment was that for a personal speech recognizer it makes sense that it be trainable by the user, especially when the hardware is powerful enough. |
|
Secondly, the trained model itself is very big just for storage, and inference against the model is also resource intensive. This is why Android/Google maps/search/etc go out to the google's backend recognition servers for speech to text before falling back on the shitty (but relatively good) offline inline model (that may not even be using state of the art speech recognition techniques and may be using old school GMM based recognizers).
Finally, the large models trained on the backend servers using their distributed computing infrastructure are extremely more accurate than the shitty fallback model, so speaker dependent adaptations aren't necessary. If you can get very very good performance from a speaker independent model, why would you put the extra effort to make speaker dependent adaptations if the gain is very marginal? Not to mention the fact that speaker independent models are more useful in more situations and are extremely powerful. Google for instance can caption videos automatically using speech recognition, which is amazing. If the models were speaker dependent they wouldn't be able to do that. That's why the focus has been so much towards speaker independent models.