The built in offline Android speech recognizer is really bad. Giants like Google and Facebook are blessed with data, and so they can train state of the art speech recognition models (much much better than what you get out of the built in Android recognizer) and then provide speech recognition as a service. They can control the recognition because it happens on their servers and is independent of Android or any other OS.
And so FB for instance can send some voice data to their servers and get a text output. And then FB can use text sentiment analysis to get further context about the message.
Sadly, most people don't have the speech data to train their own recognizers on large vocabulary systems, and that's even harder for languages that are not English. With exception of Google/Amazon/FB/Microsoft/Baidu/etc other people have to use the API's offered by the above companies to do high fidelity recognition. Which sucks because there is a cost to each recognition. You have to pay someone else to do it.
Whereas FB/Amazon/MS/Baidu/etc can do high fidelity recognition offline on large vocabulary and offer it as a service. THIS is why FB wants to make speech recognition systems.
Labeled data is indeed a problem. The only sizable corpus I know of is TIMIT and it costs $300 and I think has prohibitions on commercial use. That said, phonetic labeling is becoming less important thanks to designs like this...
I wonder if you could bootstrap a sizable speech dataset by trawling audio off YouTube and then using one of the really good cloud speech recognition services to label it. :)
IMHO, the TIMIT corpus should no longer be used in most application-driven speech recogniton research, as it’s small and completely unrealistic for any real world application. Furthermore, nobody cares about phone error rates, as recognizing phones is not the ultimate goal.
There have been much better, larger datasets available for a long time, for example the Fisher English conversational telephone speech corpus was released in 2004 and contains ~1950h of transcribed speech. There are tons of other datasets in various languages and for various applications (conversational speech, broadcast transcription, etc.).
The labeled data is $300? That's basically free, even for somebody who's just a serious hobbyist, much less any funded public or private research group.
Edit: It's even less [1]:
$0.00 1993 Member
$250.00 Non-Member
$125.00 Reduced-License
The built in offline Android speech recognizer is really bad. Giants like Google and Facebook are blessed with data, and so they can train state of the art speech recognition models (much much better than what you get out of the built in Android recognizer) and then provide speech recognition as a service. They can control the recognition because it happens on their servers and is independent of Android or any other OS.
Is the implication that offline Android recognition does not train on the owner's voice at all? I imagine a lot of phones these days are at least as powerful as the Pentium 200s used to train (successfully!) Dragon Dictate et al 20+ years ago.
Well let's forget the offline android recognizer. That's the one that's built in and doesn't go to Google via the internet to get a better accuracy transcription. It's fairly good for what it is but doesn't come close to the accuracy when you go to the google recognition servers via their API's. That's because the models they offer via the recognition services are much larger, robust and better than what you get straight out of Android. These services offered by companies such as Google do not adapt the acoustic model to individual speakers and are therefore known to be speaker independent.
Secondly, when I say "train", it is in a totally different context than how you seem to be using the term. You are using it in the context of adapting an acoustic model to a individual speaker to improve the performance. I am talking about building the initial model. Typical RNN or even convolution based algorithms require a lot of time and processing power to train. What's even harder to get than the processing power though is of course, data to train off of.
I think you're making a distinction without a difference, since there is (or has been for a long time) an initial model supplied by personal recognition devices/software/etc., too. And sure, if it's not trainable by the user it's going to be using a generalized model. There are tradeoffs there, and the point of my comment was that for a personal speech recognizer it makes sense that it be trainable by the user, especially when the hardware is powerful enough.
This is false. The distinction I was making is very real. The "initial models" you are talking about are small and weak, nowhere near as robust or powerful as the models trained and used by Google/Microsoft/etc on their own servers. State of the art neural networks based recognizers need serious hardware for training, orders of magnitudes more than what is available in smart phones/personal commodity hardware (unless it's massively clustered and distributed).
Secondly, the trained model itself is very big just for storage, and inference against the model is also resource intensive. This is why Android/Google maps/search/etc go out to the google's backend recognition servers for speech to text before falling back on the shitty (but relatively good) offline inline model (that may not even be using state of the art speech recognition techniques and may be using old school GMM based recognizers).
Finally, the large models trained on the backend servers using their distributed computing infrastructure are extremely more accurate than the shitty fallback model, so speaker dependent adaptations aren't necessary. If you can get very very good performance from a speaker independent model, why would you put the extra effort to make speaker dependent adaptations if the gain is very marginal? Not to mention the fact that speaker independent models are more useful in more situations and are extremely powerful. Google for instance can caption videos automatically using speech recognition, which is amazing. If the models were speaker dependent they wouldn't be able to do that. That's why the focus has been so much towards speaker independent models.
> The built in offline Android speech recognizer is really bad.
I totally disagree. Compared to Sphinx it is still lightyears better.
To wit I use it for my android based home automation voice recognition and even from a distance with background noise it still works about >90% accuracy. My original tests with Sphinx in a similar environment garnered about 30%.
I completely agree with you that it's much better than Sphinx/Pocketsphinx. It's even much better than Microsoft's built in speech recognizer that's been around since XP. But it is still very bad compared to the recognition available via Google's voice API and that was the point. Also, I was trying to explain that given the types of models used today for recognition, inherently models trained and hosted somewhere are going to be bigger and more accurate than ones deployed on the field.
And so FB for instance can send some voice data to their servers and get a text output. And then FB can use text sentiment analysis to get further context about the message.
Sadly, most people don't have the speech data to train their own recognizers on large vocabulary systems, and that's even harder for languages that are not English. With exception of Google/Amazon/FB/Microsoft/Baidu/etc other people have to use the API's offered by the above companies to do high fidelity recognition. Which sucks because there is a cost to each recognition. You have to pay someone else to do it.
Whereas FB/Amazon/MS/Baidu/etc can do high fidelity recognition offline on large vocabulary and offer it as a service. THIS is why FB wants to make speech recognition systems.