Hacker News new | ask | show | jobs
by jalopy 1774 days ago
Going along with this: What are the latest and greatest open source speech-to-text models and/or tools out there?

Would love to hear from experienced practitioners and a bit of detail on the experience.

Thanks HN community!

8 comments

Mozilla announced Deep Speech[1] around the same time as Common Voice.

Mozilla Deep Speech is an open source speech recognition engine, based upon Baidu's Deep Speech research paper[2].

Unsurprisingly, Deep Speech requires a corpus such as... Common Voice.

[1] https://github.com/mozilla/DeepSpeech

[2] https://arxiv.org/abs/1412.5567

They killed this after Nvidia grant.
Ah, damn. Didn't realise.

It also looks like Baidu are now developing their Deep Speech as open source? https://github.com/PaddlePaddle/DeepSpeech

I've had good results with https://github.com/flashlight/flashlight/blob/master/flashli.... Seems to work well with spoken english in a variety of accents. Biggest limitation is that the architecture they have pretrained models for doesn't really work well with clips longer than ~15 seconds, so you have to segment your input files.
I created edgedict [0] a year ago part of my side projects. At that time this is the only open source STT with streaming capabilities. If anyone is interested the pretrained weights for english and chinese is available.

[0] https://github.com/theblackcat102/edgedict

Kaldi and DeepSpeech both support streaming, right?
Have used VOSK a bit recently. The out-of-the-box experience was great compared to earlier projects (looking at you Kaldi and Sphinx...). Word-level audio segmentation was one usecase, https://stackoverflow.com/a/65370463/1967571
Vosk is built on Kaldi.
Kdenlive supports automatic subtitles created with VOSK now btw. This makes it a lot more accessible for non-tech folks.
Vosk is my favourite. I have used deep speech too. Vosk works better.
Thank you. I deeply appreciate you mention our efforts. We spend quite some time and knowledge to build accurate speech recognition. Not that easy to get as much mentions as Mozilla, so we are thankful for every single one!
Vosk just works good and it works on mobile platforms too. One suggestion is to put lisence on alphacephei site. GitHub repo has it, but site doesn't.
Same question for text-to-speech!