| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by refulgentis 698 days ago

This is fascinating because if your hint in another comment indicates you worked on this at Google, it's entirely possible I have this all wrong because I'm missing the actual ML part - I wrote the client encoder & server decoder for Opus and the client-side UI for the SODA launch, and I'm honestly really surprised to hear Google has different stuff. The client-side code loop AGSA used is 100% replicated, in my experience, by using Whisper.

I don't want to make too strong of claims given NDAs (in reality, my failing memory :P) but I'm 99% sure inference on-device is just as fast as SODA. I don't know what to say because I'm flummoxed, it makes sense to me that Whisper isn't as good as SODA, and I don't want to start banging the table about that its no different from a user or client perspective, I don't think that's fair. There's a difference in model architecture and it matters. I think its at least a couple WER behind.

But then where's the better STT solutions? Are all the obviously much better solutions really all locked up? Picovoice is the only closed solution I know of available for local dev, and per even them, it's only better than the worst Whisper. Smallest is 70 MB in ONNX vs. 130 MB for next step up, both inference fine with ~600 ms latency from audio byte to mic to text on screen, ranging from WASM in web browser to 3 year old Android phone.

2 comments

regularfry 698 days ago

Something to keep an eye on is that Whisper is strongly bound to processing a 30-second window at a time. So if you send it 30 seconds of audio, and it decodes it, then you send it another one second of audio, the only sensible way it can work is to have it reprocess seconds 2s-30s in addition to the new data at 31s. If there was a way to have it just process the update, then there's every possibility it could avoid a lot of work.

I suspect that's what people are getting at by saying it's "not streaming": it's built as a batch process but, under some circumstances, you can run it fast enough to get away with pretending that it isn't.

opprobium 698 days ago

You are missing the speech decoding part. I can't speak to why the clients you were working on were doing what they were doing. For a different reference point see the cloud streaming api.

This is a good public reference: https://research.google/blog/an-all-neural-on-device-speech-...

Possibly confusions from that doc: "RNN-T" is entirely orthogonal to RNNs (and not the only streamable model). Attention is also orthogonal to streaming. A chunked or sliding window attention can stream, a bi-directional RNN cannot. How you think of an encoder and a decoder streaming is also different.

At a practical level, if a model is fast enough, and VAD is doing an adequate job, you can get something that looks like "streaming" which a non-streaming model. If a streaming model has tons of look-ahead or a very large input chunk size, its latency may not feel a lot better.

Where the difference is sharp is where VAD is not adequate: Users speak in continuous streams of audio, they leave in unusual gaps within sentences and run sentences together. A non-streaming system either hurts quality because sentences (or even words) get broken up that shouldn't, or has to wait forever and doesn't get a chance to run, when a streaming system would have already been producing output.

And to your points about echo cancellation and interference: There's many text only operations that benefit from being able to start early in the audio stream, not late.

I just went through process of helping someone stand up an interactive system with whisper etc and the lack of an open sourced whisper-quality streaming system is such a bummer because it really is so much laggier than it has to be.