Seems all the methods in the writeup are APIs (not sure about wit or sphinx), so what's missing is missing locally-run processes like DeepSpeech. But on that same note, I'd like to see greater accuracy comparisons on all these methods, and pricing (googly gets to around $1.44 / recorded hour?) since that's a significant factor.
From prior use, Google's speech API (at least the "video" model) is freakishly accurate compared to DeepSpeech to where I wondered if they used closed captioning to help train their model. But I haven't seen rest of these at work: https://i.imgur.com/cdOlARO.png
afaik, pure DNN models still lag seriously behind 'traditional' HMM-based frameworks augmented by neural networks (using DNNs for specific parts of the pipeline). Last I checked a couple months ago, state of the art for HNN+DNN was something like 6% word error rate (WER). The best Seq2Seq DNN I know of hit 18% WER, dropping to 10% when a secondary language model was integrated in. (my guess is that part of the problem is leaning too heavily on the attention mechanism... a more 'streaming friendly' framework should help reduce the load on the attention mechanism.)
The majority of the APIs mentioned are probably using DNNs. But those are all online-only, so higher-quality offline engines would definitely be an improvement. I wonder how much effort it would require to integrate them into the SpeechRecognition package.
From prior use, Google's speech API (at least the "video" model) is freakishly accurate compared to DeepSpeech to where I wondered if they used closed captioning to help train their model. But I haven't seen rest of these at work: https://i.imgur.com/cdOlARO.png