| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by HackedBunny 2168 days ago

My quick test of DeepSpeech, as posted on Facebook back in April:

I just ran a speech-to-text converter on a very clear clip of former Doctor Who actor Tom Baker talking in an interview.

The DeepSpeech converter uses the very latest AI deep-learning advancements to 'listen' to the audio and output the spoken words as text.

After 3 long minutes of running it on a 30-second clip, it printed out its interpretation:

"hooloomooloo how booboorowie i have a honeymoon"

4 comments

twoslide 2168 days ago

It's supposed to work on sentence long audio (4 - 5 seconds), they suggest chunking your audio first: https://discourse.mozilla.org/t/longer-audio-files-with-deep...

link

magicalhippo 2168 days ago

Also there's essentially two parts to this, the neural net is used for speech-to-characters, and then a language model is used to convert the character stream to words.

I found that the language model they supplied was trained data that did not contain the words I needed, and got significantly improved results when making my own language model using the kenlm[1] tools.

[1]: https://kheafield.com/code/kenlm/

link

zaptrem 2166 days ago

Would it be possible to substitute this for GPT2/BERT? Or is that a different type of language model? Can the pre-trained language model be fine-tuned? I’m using DeepSpeech to transcribe long-form lecture audio, and have just assumed there would be a massive improvement once they noise-harden the models with 1.0.

link

nshm 2161 days ago

GPT2 is not a good language model but there are things like XLM. Mozilla DeepSpeech doesn't support XLM rescoring, other toolkits do and it gives great improvement in accuracy. If you care about accurate transcription you'd better consider alternatives.

link

zaptrem 2158 days ago

I didn't know any other ML-based open source transcription engines existed? I can't seem to find them on Google.

link

donw 2168 days ago

I didn’t know Tom Baker was Welsh.

link

fxtentacle 2168 days ago

When I tried it out with English and German phone recordings, it was working competitively. I would have ranked it better than Amazon but worse than Google.

Did you maybe not convert your WAV to the correct sampling rate?

link

bmn__ 2168 days ago

Experiment is worthless for drawing conclusions if it's not reproducible by other people.

Besides the hypothesis that DS sucks, the software could also very well be just fine and you made methodological errors.

link