| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by darkpicnic 1291 days ago
	Does anyone know if this new model handles silence better? I was trying to use whisper for transcribing bursts of talking amid large spans of silence, but the frequency of hallucinations was too high.

2 comments

nomel 1290 days ago

I suspect a simple solution is to remove the silence, as a pre processing step in the pipeline.

link

lunixbochs 1290 days ago

In large scale tests, I observed hallucinations from Whisper in speech regions of audio.

link

nomel 1290 days ago

Sure, but that should be considered an accuracy problem. Telling a system to do its best to extract words from background sounds, and then getting words from it, is a different type of problem.

-------

I can't reply to the below, but you have to consider the difference in the signal to noise ratio for why it should be considered a different problem.

If I told a binary image classifier to classify a clear image of a cat as either a "cat" or a "dog", and it said "dog", then that would be an accuracy problem.

If I gave the same classifier an image of a black cat standing in a very dark room, where even a human would have trouble identifying it, and it says "dog" it's not an accuracy problem as much as a signal to noise ratio problem.

It seems like you're making the assumption that all of these have the issues you describe have the same root cause. I don't think that's a sound assumption...tehe.

link

gibolt 1290 days ago

Still important for future use to not have invalid results. This is a workaround for now

link

rozab 1290 days ago

You don't need ML to trim out silence

link

sdenton4 1290 days ago

Silence is often problem dependent... You may want ML to differentiate between noisy audio with speech and noisy audio without speech.

link

darkpicnic 1290 days ago

"Silence" is a problematic term. For me, that word encompasses: squeaky chairs, typing on a loud keyboard, moving objects around on my table, etc. In a perfect world, Whisper —like a human— can easily distinguish a human voice from the din of my office, and only try and transcribe my voice.

Does anyone have solutions for clearing out "silence" from an audio file that works off something a bit more accurate than just "<= decibel x"?

Edited for grammar.

link