Hacker News new | ask | show | jobs
by capo64 3005 days ago
No mention of DNN based ASR like DeepSpeech? There’s even open source python implementations available from Mozilla and Paddle.

These models are way easier to train, have surprisingly good accuracy, and are robust to noise.

4 comments

Seems all the methods in the writeup are APIs (not sure about wit or sphinx), so what's missing is missing locally-run processes like DeepSpeech. But on that same note, I'd like to see greater accuracy comparisons on all these methods, and pricing (googly gets to around $1.44 / recorded hour?) since that's a significant factor.

From prior use, Google's speech API (at least the "video" model) is freakishly accurate compared to DeepSpeech to where I wondered if they used closed captioning to help train their model. But I haven't seen rest of these at work: https://i.imgur.com/cdOlARO.png

afaik, pure DNN models still lag seriously behind 'traditional' HMM-based frameworks augmented by neural networks (using DNNs for specific parts of the pipeline). Last I checked a couple months ago, state of the art for HNN+DNN was something like 6% word error rate (WER). The best Seq2Seq DNN I know of hit 18% WER, dropping to 10% when a secondary language model was integrated in. (my guess is that part of the problem is leaning too heavily on the attention mechanism... a more 'streaming friendly' framework should help reduce the load on the attention mechanism.)

https://arxiv.org/pdf/1610.03022.pdf

This has changed recently, full seq2seq is now matching hybrid models [0].

[0] https://arxiv.org/abs/1712.01769

Oh, thanks! Now I know what I'm reading on the commute tomorrow.
The majority of the APIs mentioned are probably using DNNs. But those are all online-only, so higher-quality offline engines would definitely be an improvement. I wonder how much effort it would require to integrate them into the SpeechRecognition package.
+1