| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by refulgentis 706 days ago

I did some core work on TTS at Google, at several layers, and I've never quite understood what people mean by streaming vs. not.

In each and every case I'm familiar with, streaming means "send the whole audio thus far to the inference engine, inference it, and send back the transcription"

I have a Flutter library that does the same flow as this (though via ONNX, so I can cover all platforms), and Whisper + Silero is ~identical to the interfaces I used at Google.

If the idea is streaming is when each audio byte is only sent once to the server, there's still an audio buffer accumulated -- its just on the server.

3 comments

opprobium 706 days ago

Streaming for TTS doesn't matter but for speech to text it is more meaningful in interactive cases. In that case the user's speech is arriving in real time and streaming can mean a couple levels of things:

- Overlap compute with the user speaking: Not having to wait until all the speech has been acquired can massively reduce latency at the end of speech and allow a larger model to be used. This doesn't have to be the whole system, for instance an encoder can run in this fashion along audio as it comes in even if the final step of the system then runs in a non-streaming fashion.

- Produce partial results while the user is speaking: This can be just a UI nice to have, but it can also be much deeper, eg, a system can be activating on words or phrases in the input before the user is finished speaking which can dramatically change latency.

- Better segmentation: Whisper + Silero is just using VAD to make segments for Whisper, this is not at all the best you can do if you are actually decoding while you go. Looking at the results as you go allow you to make much better and faster segmentation decisions.

refulgentis 706 days ago

The only models that do what you're poking at hostically are 4o (claimed) and that french company with the 7B one. They're also bleeding edge, either unreleased or released and way wilder, ex. The french one interrupts too much, and screams back in an alien language occasionally.

Until these, you'd use echo cancellation to try and allow interruptible dialogue, and thats unsolved, you need a consistently cooperative chipset vendor for that (read: wasn't possible even at scale, carrots, presumably sticks, and with nuch cajoling. So it works on iPhones consistently.)

The partial results are obtained by running inference on the entire audio so far, and silence is determined by VAD, on every stack I've seen that is described as streaming

I find it hard to believe that Google and Apple specifically, and every other audio stack I've seen, are choosing to do "not the best they can at all"

opprobium 706 days ago

This is exactly what Google ASR does. Give it a try and watch how the results flow back to you, it certainly is not waiting for VAD segment breaking. I should know.

Streaming used to be something people cared about more. VAD is always part of those systems as well, you want to use it to start segments and to hard cut-off, but it is just the starting off point. It's kind of a big gap (to me) that's missing in available models since Whisper came out, partly I think because it does add to the complexity of using the model, and latency has to be tuned/traded-off with quality.

r2_pilot 706 days ago

Thank you for your insight. It confirms some of my suspicions working in this area (you wouldn't happen to know anybody who makes anything more modern than the Respeaker 4-mic array?). My biggest problem is even with AEC, the voice output is triggering the VAD and so it continually thinks it's getting interrupted by a human. My next attempt will be to try to only signal true VAD if there's also sound coming from anywhere but behind, where the speaker is. It's been an interesting challenge so far though.

refulgentis 706 days ago

Re: mic, alas, no, BigCo kinda sucked, I had to go way out of my way to get work on interesting stuff, it never mattered, and even when you did, you never got over the immediate wall of your own org, except for brief moments. i.e. never ever had anyone even close to knowing anything about the microphones we'd be using, they were shocked to hear what AEC was, even when what we were working on was a marketing tentpole for Pixel. Funny place.

I'm really glad you saw this. So, so, so much time and hope was wasted there on the Nth team of XX people saying "how hard can it be? given physics and a lil ML, we can do $X", and inevitably reality was far more complicated, and it's important to me to talk about it so other people get a sense it's not them, it's the problem. Even unlimited resources and your Nth fresh try can fail.

FWIW my mind's been grinding on how I'd get my little Silero x Whisper gAssistant on device replica pulling off something akin to the gpt4o demo. I keep coming back to speaker ID: replace Silero with some newer models I'm seeing hit ONNX. Super handwave-y, but I can't help thinking this does an end-around both AEC being shit on presumably most non-Apple devices, and poor interactions from trying to juggle two things operating differently (VAD and AEC). """Just""" detect when there's >= 2 simultaneous speakers with > 20% confidence --- of course, tons of bits missing from there, ideally you'd be resilient to ex. TV in background. Sigh. Tough problems.

azeirah 706 days ago

I'm not particularly experienced, but I did have good experiences with picovoice's services. It's a business specialised in programmatically available audio, tts, vad services etc.

They have a VAD that is trained on a 10 second clip of -your- voice, and it is then only activated by -your- voice. It works quite well in my experience, although it does add a little bit of additional latency before it starts detecting your voice (which is reasonably easy to overcome by keeping a 1s buffer of voice ready at all times. If the vad is active, just add the past 100-200ms of the buffer to the recorded audio. Works perfectly fine. It's just that the UI showing "voice detected" or "voice not detected" might lag behind 100-200ms)

Source: I worked on a VAD + whisper + LLM demo project this year and ran into some VAD issues myself too.

Nimitz14 706 days ago

This is a complete non sequitur lol. FYI whisper is not a streaming model though it can, with some work, be adapted to be one.

refulgentis 706 days ago

You and I agree fully, then. IMHO it's not too much work, at all, 400 LOC and someone else's models. Of course, as in that old saw, the art is knowing exactly those models, knowing what ONNX is, etc. etc., that's what makes it fast.

The non-sequitor is because I can't feel out what's going on from their perspective, the hedging left a huge range where they could have been saying "I saw the gpt4o demo and theres another way that lets you have more natural conversation" and "hey think like an LSTM model, like Silero, there are voice recognizers that let you magically get a state and current transcription out", or in between, "yeah in reality the models are f(audio bytes) => transcription", which appears to be closer to your position, given your "it's not a streaming model, though it can be adapted"

iamjackg 706 days ago

I think in practical terms (at least for me):

- streaming == I talk and the text appears as I talk

- batched == I talk, and after I'm done talking some processing happens and the text gets populated

refulgentis 706 days ago

Gotcha, then, it's "not even wrong" in the Pauli sense to say Whisper isn't streaming

opprobium 706 days ago

It is not streaming in the way people normally use this term. It's a fuzzy notion but typically streaming means something encompassing:

- Processing and emitting results on something closer to word by word level - Allowing partial results while the user is still speaking and mid-segment - Not relying on an external segmenter to determine the chunking (and therefore also latency) of the output.

refulgentis 706 days ago

This is fascinating because if your hint in another comment indicates you worked on this at Google, it's entirely possible I have this all wrong because I'm missing the actual ML part - I wrote the client encoder & server decoder for Opus and the client-side UI for the SODA launch, and I'm honestly really surprised to hear Google has different stuff. The client-side code loop AGSA used is 100% replicated, in my experience, by using Whisper.

I don't want to make too strong of claims given NDAs (in reality, my failing memory :P) but I'm 99% sure inference on-device is just as fast as SODA. I don't know what to say because I'm flummoxed, it makes sense to me that Whisper isn't as good as SODA, and I don't want to start banging the table about that its no different from a user or client perspective, I don't think that's fair. There's a difference in model architecture and it matters. I think its at least a couple WER behind.

But then where's the better STT solutions? Are all the obviously much better solutions really all locked up? Picovoice is the only closed solution I know of available for local dev, and per even them, it's only better than the worst Whisper. Smallest is 70 MB in ONNX vs. 130 MB for next step up, both inference fine with ~600 ms latency from audio byte to mic to text on screen, ranging from WASM in web browser to 3 year old Android phone.

regularfry 705 days ago

Something to keep an eye on is that Whisper is strongly bound to processing a 30-second window at a time. So if you send it 30 seconds of audio, and it decodes it, then you send it another one second of audio, the only sensible way it can work is to have it reprocess seconds 2s-30s in addition to the new data at 31s. If there was a way to have it just process the update, then there's every possibility it could avoid a lot of work.

I suspect that's what people are getting at by saying it's "not streaming": it's built as a batch process but, under some circumstances, you can run it fast enough to get away with pretending that it isn't.

opprobium 705 days ago

You are missing the speech decoding part. I can't speak to why the clients you were working on were doing what they were doing. For a different reference point see the cloud streaming api.

This is a good public reference: https://research.google/blog/an-all-neural-on-device-speech-...

Possibly confusions from that doc: "RNN-T" is entirely orthogonal to RNNs (and not the only streamable model). Attention is also orthogonal to streaming. A chunked or sliding window attention can stream, a bi-directional RNN cannot. How you think of an encoder and a decoder streaming is also different.

At a practical level, if a model is fast enough, and VAD is doing an adequate job, you can get something that looks like "streaming" which a non-streaming model. If a streaming model has tons of look-ahead or a very large input chunk size, its latency may not feel a lot better.

Where the difference is sharp is where VAD is not adequate: Users speak in continuous streams of audio, they leave in unusual gaps within sentences and run sentences together. A non-streaming system either hurts quality because sentences (or even words) get broken up that shouldn't, or has to wait forever and doesn't get a chance to run, when a streaming system would have already been producing output.

And to your points about echo cancellation and interference: There's many text only operations that benefit from being able to start early in the audio stream, not late.

I just went through process of helping someone stand up an interactive system with whisper etc and the lack of an open sourced whisper-quality streaming system is such a bummer because it really is so much laggier than it has to be.

flax 706 days ago

"streaming" in this case is like another reply said: transcriptions appear as I talk. Compared to not-streaming in which the service waits for silence, then processes the captured speech, then returns some transcription.

Is your Flutter library available? And does it run locally? I'm looking for a good Flutter streaming (in the sense above) speech recognition library. vosk looks good, but it's lacking some configurability such as selecting audio source.

refulgentis 706 days ago

FONNX, haven't gone out of my way to make it trivial[1], but, it's very good, battle tested on every single platform. (And yes runs locally)

[1] example app shows how to do everything, there's basic doc, but man the amount of nonsense you need to know to pull it all together is just too hard to document without a specific Q. Do feel free to file an issue