| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by isoprophlex 374 days ago
	Indeed, with another model I would get persistent transcriptions of silent parts into 'Thanks for watching!' or '[MUSIC]'. Pretty dumb that this failure mode wasn't caught in some QA process, and there are now multiple transcription models suffering from the same issue. Having silent parts in your input audio seems like it should be a very common occurrence...

2 comments

rollcat 374 days ago

When I was taught mathematics, the zero value was always considered the most important edge case. You prove something for N=0 (or N=1), then for N=M+1.

It's even more important in audio DSP: processing near-zeroes can end up being extremely CPU intensive, look up denormal/subnormal floats.

link

inglor_cz 374 days ago

Yeah, I studied mathematics (algebra and number theory) and zero is the point, often sporting discontinuities, or weird asymptotic behavior.

Quite a lot of algorithms use some form of division and zero is the only number in our typical structures (Z, Q, R, C), that cannot be used to divide with.

link

edwcross 374 days ago

In machine integer arithmetics, one must also beware division by -1, which can convert MIN_INT into MIN_INT with a signed overflow and violate some arithmetics invariants, such as sign (negative divided by negative is _usually_ positive).

link

isoprophlex 374 days ago

Well, now in this brave new age of AI we can enjoy computer programs crashing with an

    Error: division by please upvote, share and like!

link

xyproto 374 days ago

This also works; I upvoted your comment.

link

o1bf2k25n8g5 374 days ago

I have discovered a truly marvelous proof of how to smash that like and subscribe button, which this comment box is too small to contain.

link

msopena 374 days ago

Signed by Pierre de FermAIt

link

KeplerBoy 374 days ago

Denormals are flushed to zero by default on most GPUs by the way.

link

rollcat 373 days ago

Makes total sense, execution time is bounded. The point is it's still a case you must consider (what if near-zero is distinct from zero and significant?)

link

wahnfrieden 374 days ago

whisper MUST be combined with silence detection / VAD

link

pferde 374 days ago

Ah, the good old "you're holding it wrong".

What good is a speech recognition tool that literally hears imaginary voices?

link

zettabomb 374 days ago

Considering that if you DO use VAD (voice activity detection), it's the best open weights voice recognition model by a very wide margin, it's quite good. I'd be willing to be that commercial products that "don't have this problem" are using VAD as well, and that this is well known to them. But Whisper is just the weights, and I suppose a simple reference implementation, not a full product.

link

bmacho 374 days ago

> What good is a speech recognition tool that literally hears imaginary voices?

Well, if it is supposed to work after silence detection, then it is good for speech recognition I guess. It's like blaming a wheel why is it circular, you can't sit on it. It's a part of a larger machine.

link

dumbfounder 374 days ago

Just lay the wheel on its side and it makes a fine seat.

link

nhecker 373 days ago

>imaginary voices

On the other hand, I can imagine that when things get quiet and the signal-to-noise ratio gets close to zero, random background audio (or randomness introduced in the transcription model) will be enough to tickle a critical number of neurons and elicit hallucinations.

The related thought exercise is this: Try scanning across the band with an AM or sideband radio, and after a while your brain will start to wonder "was that a voice I just heard, or music perhaps?" when in reality it was just environmental static.

link

wahnfrieden 374 days ago

Yes, you are holding it wrong. The good of it is that it does not output imaginary voices when used with VAD.

Show us a technology with better results that does not use VAD. If you can’t, then I’m not sure what you’re arguing against except superficialities so inconsequential that I can’t comprehend the condescension. The results speak for itself

link

Xmd5a 374 days ago

faster-whisper has a min_silence_duration_ms option

link

wahnfrieden 374 days ago

There are much higher quality VAD solutions available

link

DANmode 372 days ago

Please name a couple to get someone started who's hacking on webapps?

I'd really appreciate it.

link

DANmode 372 days ago

(as would future readers, I'm sure)

link

xandrius 374 days ago

So if a tool has a process to have it perform at its best then it's a problem?

Do you also moan that before applying glue to a surface or it won't stick? Or if you need to drill a guiding hole before making a larger one in wood? Or that you need to use truly prime numbers for a security key to actually be safe?

link

DANmode 373 days ago

What's a good starter VAD lib, and if you know, the best implementation of something like this to use in a browser-based app?

Say if I wanted to use it for Voice Nav, or Voice Input, but not piss off random people speaking the wrong language.

link

cmiles74 374 days ago

If that's truly the case then they should make it part of the product, IMHO.

link

wahnfrieden 373 days ago

How is it not the case? It is unusable without VAD or editing. I don't understand what you're questioning

I agree their products could be better "end to end" integrated. Meanwhile there is a continuously-improving field of work for detecting speech (which Whisper is incapable of). They offer official "cookbooks" with guidance on an approach they recommend: https://cookbook.openai.com/examples/whisper_processing_guid...

> At times, files with long silences at the beginning can cause Whisper to transcribe the audio incorrectly. We'll use Pydub to detect and trim the silence.

(Official OpenAI quote)

What's VAD?

Voice Activity Detection (it predicts whether a short clip contains speech, eg to mute your microphone when you aren't speaking).

link