| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by qwertox 1174 days ago

Whisper cuts the audio into chunks of 30 seconds. So If you have a one-minute recording where the first half has conversation in it and the remainder nothing, then it will think that it has to find something in that second 30 seconds block without knowing how "speech" actually sounded like as it did in the first chunk.

Try to pre-process it where just "voice" is detected, not the meaning, just some speaking, and cut the audio into snippets which only contain speech, so that Whisper doesn't have to guess if the segment will contain speech or not.

Also, if you cut it up into chunks and let it transcribe each chunk and expect JSON as the output, instead of the other output methods, then you'll get a bunch of extra parameters with it which will help you identifying problematic sections. For example hallucinated repetitions usually have a higher "no_speech_prob" parameter, or segments with lower "compression_ratio" will also not be that accurate.