| HN Mirror

I was working on this yesterday. It seems that the most common approach with Whisper is simply to break the audio into chunks and transcribe each one separately. This works but as you'd expect sometimes has trouble at the edges. The segments also have to be sufficiently long (like 10s) or the accuracy suffers, meaning it's not truly real-time.

You could do better by overlapping the segments, except then stitching the transcriptions together becomes an issue since whisper doesn't provide reliable per-token timestamps [0], and the output of the common part of overlapping segments isn't necessarily the same. I can imagine a cool approach where you transcribe long, overlapping chunks in real-time and intelligently merge the stream of words somehow though.

Some more useful discussion here (whisper.cpp project, but still relevant) [1].

0. https://github.com/openai/whisper/discussions/332

1. https://github.com/ggerganov/whisper.cpp/issues/10