Hacker News new | ask | show | jobs
by darkpicnic 1291 days ago
Does anyone know if this new model handles silence better? I was trying to use whisper for transcribing bursts of talking amid large spans of silence, but the frequency of hallucinations was too high.
2 comments

I suspect a simple solution is to remove the silence, as a pre processing step in the pipeline.
In large scale tests, I observed hallucinations from Whisper in speech regions of audio.
Sure, but that should be considered an accuracy problem. Telling a system to do its best to extract words from background sounds, and then getting words from it, is a different type of problem.

-------

I can't reply to the below, but you have to consider the difference in the signal to noise ratio for why it should be considered a different problem.

If I told a binary image classifier to classify a clear image of a cat as either a "cat" or a "dog", and it said "dog", then that would be an accuracy problem.

If I gave the same classifier an image of a black cat standing in a very dark room, where even a human would have trouble identifying it, and it says "dog" it's not an accuracy problem as much as a signal to noise ratio problem.

It seems like you're making the assumption that all of these have the issues you describe have the same root cause. I don't think that's a sound assumption...tehe.

Still important for future use to not have invalid results. This is a workaround for now
You don't need ML to trim out silence
Silence is often problem dependent... You may want ML to differentiate between noisy audio with speech and noisy audio without speech.
"Silence" is a problematic term. For me, that word encompasses: squeaky chairs, typing on a loud keyboard, moving objects around on my table, etc. In a perfect world, Whisper —like a human— can easily distinguish a human voice from the din of my office, and only try and transcribe my voice.

Does anyone have solutions for clearing out "silence" from an audio file that works off something a bit more accurate than just "<= decibel x"?

Edited for grammar.