I've had good results with https://github.com/flashlight/flashlight/blob/master/flashli.... Seems to work well with spoken english in a variety of accents. Biggest limitation is that the architecture they have pretrained models for doesn't really work well with clips longer than ~15 seconds, so you have to segment your input files.
I created edgedict [0] a year ago part of my side projects. At that time this is the only open source STT with streaming capabilities. If anyone is interested the pretrained weights for english and chinese is available.
Have used VOSK a bit recently. The out-of-the-box experience was great compared to earlier projects (looking at you Kaldi and Sphinx...). Word-level audio segmentation was one usecase, https://stackoverflow.com/a/65370463/1967571
Thank you. I deeply appreciate you mention our efforts. We spend quite some time and knowledge to build accurate speech recognition. Not that easy to get as much mentions as Mozilla, so we are thankful for every single one!
Mozilla Deep Speech is an open source speech recognition engine, based upon Baidu's Deep Speech research paper[2].
Unsurprisingly, Deep Speech requires a corpus such as... Common Voice.
[1] https://github.com/mozilla/DeepSpeech
[2] https://arxiv.org/abs/1412.5567