| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by phkahler 307 days ago
	I thought whisper and others took large chunks (20-30 seconds) of speech, or a complete wave file as input. How do you get real-time transcription? What size chunks do you feed it? To me, STT should take a continuous audio stream and output a continuous text stream.

1 comments

yujonglee 307 days ago

I use VAD to chunk audio.

Whisper and Moonshine both works in a chunk, but for moonshine:

> Moonshine's compute requirements scale with the length of input audio. This means that shorter input audio is processed faster, unlike existing Whisper models that process everything as 30-second chunks. To give you an idea of the benefits: Moonshine processes 10-second audio segments 5x faster than Whisper while maintaining the same (or better!) WER.

Also for kyutai, we can input continuous audio in and get continuous text out.

- https://github.com/moonshine-ai/moonshine - https://docs.hyprnote.com/owhisper/configuration/providers/k...

link

zveyaeyv3sfye 306 days ago

Having used whisper and noticed the useless quality due to their 30-second chunks, I would stay far away from software working on even a shorter duration.

The short duration effectively means that the transcription will start producing nonsense as soon as a sentence is cut up in the middle.

link

mijoharas 307 days ago

Something like that, in a cli tool, that just gives text to stdout would be perfect for a lot of use cases for me!

(maybe with an `owhisper serve` somewhere else to start the model running or whatever.)

link

ctbellmar 306 days ago

I wrote a tool that may be just the thing for you:

https://github.com/bikemazzell/skald-go/

Just speech to text, CLI only, and it can paste into whatever app you have open.

link

mijoharas 306 days ago

Oh, this does sound cool. Couple of questions that aren't clear from the readme (to me).

What exactly does the silence detection mean? does that mean it'll wait until a pause, and then send the audio off to whisper, and return the output (and stop the process)? Same question with continuous. Does that just mean it continues going until CTRL+C?

Nvm, answered my own question, looks like yes for both[0][1]. Cool this seems pretty great actually.

[0] https://github.com/bikemazzell/skald-go/blob/main/pkg/skald/...

[1] https://github.com/bikemazzell/skald-go/blob/main/pkg/skald/...

link

yujonglee 307 days ago

Are you thinking about the realtime use-case or batch use-case?

For just transcribing file/audio,

`owhisper run <MODEL> --file a.wav` or

`curl httpsL//something.com/audio.wav | owhisper run <MODEL>`

might makes sense.

link

mijoharas 307 days ago

agreed, both of those make sense, but I was thinking realtime. (pipes can stream data, I'd like and find useful something that can stream tts to stdout in realtime.)

link

yujonglee 307 days ago

It's open-source. Happy to review & merge if you can send us PR!

https://github.com/fastrepl/hyprnote/blob/8bc7a5eeae0fe58625...

link