| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jalopy 1822 days ago

Going along with this: What are the latest and greatest open source speech-to-text models and/or tools out there?

Would love to hear from experienced practitioners and a bit of detail on the experience.

Thanks HN community!

8 comments

orra 1822 days ago

Mozilla announced Deep Speech[1] around the same time as Common Voice.

Mozilla Deep Speech is an open source speech recognition engine, based upon Baidu's Deep Speech research paper[2].

Unsurprisingly, Deep Speech requires a corpus such as... Common Voice.

[1] https://github.com/mozilla/DeepSpeech

[2] https://arxiv.org/abs/1412.5567

link

rasz 1822 days ago

They killed this after Nvidia grant.

link

orra 1822 days ago

Ah, damn. Didn't realise.

It also looks like Baidu are now developing their Deep Speech as open source? https://github.com/PaddlePaddle/DeepSpeech

link

kcorbitt 1822 days ago

I've had good results with https://github.com/flashlight/flashlight/blob/master/flashli.... Seems to work well with spoken english in a variety of accents. Biggest limitation is that the architecture they have pretrained models for doesn't really work well with clips longer than ~15 seconds, so you have to segment your input files.

link

blackcat201 1822 days ago

I created edgedict [0] a year ago part of my side projects. At that time this is the only open source STT with streaming capabilities. If anyone is interested the pretrained weights for english and chinese is available.

[0] https://github.com/theblackcat102/edgedict

link

lazyresearcher 1821 days ago

Kaldi and DeepSpeech both support streaming, right?

link

mazoza 1822 days ago

https://github.com/coqui-ai/STT

link

jononor 1822 days ago

Have used VOSK a bit recently. The out-of-the-box experience was great compared to earlier projects (looking at you Kaldi and Sphinx...). Word-level audio segmentation was one usecase, https://stackoverflow.com/a/65370463/1967571

link

woodson 1822 days ago

Vosk is built on Kaldi.

link

stegrot 1822 days ago

Kdenlive supports automatic subtitles created with VOSK now btw. This makes it a lot more accessible for non-tech folks.

link

zerop 1822 days ago

Vosk is my favourite. I have used deep speech too. Vosk works better.

link

nshm 1822 days ago

Thank you. I deeply appreciate you mention our efforts. We spend quite some time and knowledge to build accurate speech recognition. Not that easy to get as much mentions as Mozilla, so we are thankful for every single one!

link

zerop 1821 days ago

Vosk just works good and it works on mobile platforms too. One suggestion is to put lisence on alphacephei site. GitHub repo has it, but site doesn't.

link

woodson 1822 days ago

NVidia NeMo: https://github.com/NVIDIA/NeMo

link

thom 1822 days ago

Same question for text-to-speech!

link