| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dandiep 330 days ago
	Whisper is unusable IMO because of the hallucinations. Widely documented. Removing silence from audio clips helps, but even then it will auto correct grammar, translating bilingual speech, etc. Improved in the latest audio models but not solved [1] 1. https://news.ycombinator.com/item?id=43427376

3 comments

ilyakaminsky 330 days ago

I wouldn't describe it as "unusable" so much as needing to understand its constraints and how to work around them. I built a business on top of Whisper [1] and one of the early key insights was to implement a good voice activity detection (VAD) model in order to reduce Whisper's hallucinations on silence.

[1] https://speechischeap.com

link

poly2it 330 days ago

How does this make a profit? Whisper should be $0.006 to $0.010 per minute, but you rate less than $0.001? Do you 10x the audio?

link

ilyakaminsky 330 days ago

Thanks for noticing. It took a lot of effort to optimize the pipeline every step of the way. VAD, inference server, hardware optimization, etc. But nothing that would compromise on quality. The audio is currently transcribed in its original speed. I'll be sure to publish something if I manage to speed it up without incurring any losses to the WER.

link

eric-burel 330 days ago

That's the problem with raws large models, it should always be coupled with satellite small models and logic. It's (probably) easier to detect hallucinations using a traditional ML/DL model that can catch mismatches (it's easy to build a synthetic dataset for this) than transcribing. And the simplest piece of code can detect a silence and that it should match no text.

link

horseradish7k 330 days ago

well, auto correcting grammar happens in normal subtitles too... "Why don't subtitles match dubbing?" by Tom Scott: https://youtu.be/pU9sHwNKc2c

link