Does anyone know if this new model handles silence better? I was trying to use whisper for transcribing bursts of talking amid large spans of silence, but the frequency of hallucinations was too high.
Sure, but that should be considered an accuracy problem. Telling a system to do its best to extract words from background sounds, and then getting words from it, is a different type of problem.
-------
I can't reply to the below, but you have to consider the difference in the signal to noise ratio for why it should be considered a different problem.
If I told a binary image classifier to classify a clear image of a cat as either a "cat" or a "dog", and it said "dog", then that would be an accuracy problem.
If I gave the same classifier an image of a black cat standing in a very dark room, where even a human would have trouble identifying it, and it says "dog" it's not an accuracy problem as much as a signal to noise ratio problem.
It seems like you're making the assumption that all of these have the issues you describe have the same root cause. I don't think that's a sound assumption...tehe.
"Silence" is a problematic term. For me, that word encompasses: squeaky chairs, typing on a loud keyboard, moving objects around on my table, etc. In a perfect world, Whisper —like a human— can easily distinguish a human voice from the din of my office, and only try and transcribe my voice.
Does anyone have solutions for clearing out "silence" from an audio file that works off something a bit more accurate than just "<= decibel x"?