|
|
|
|
|
by refulgentis
698 days ago
|
|
This is fascinating because if your hint in another comment indicates you worked on this at Google, it's entirely possible I have this all wrong because I'm missing the actual ML part - I wrote the client encoder & server decoder for Opus and the client-side UI for the SODA launch, and I'm honestly really surprised to hear Google has different stuff. The client-side code loop AGSA used is 100% replicated, in my experience, by using Whisper. I don't want to make too strong of claims given NDAs (in reality, my failing memory :P) but I'm 99% sure inference on-device is just as fast as SODA. I don't know what to say because I'm flummoxed, it makes sense to me that Whisper isn't as good as SODA, and I don't want to start banging the table about that its no different from a user or client perspective, I don't think that's fair. There's a difference in model architecture and it matters. I think its at least a couple WER behind. But then where's the better STT solutions? Are all the obviously much better solutions really all locked up? Picovoice is the only closed solution I know of available for local dev, and per even them, it's only better than the worst Whisper. Smallest is 70 MB in ONNX vs. 130 MB for next step up, both inference fine with ~600 ms latency from audio byte to mic to text on screen, ranging from WASM in web browser to 3 year old Android phone. |
|
I suspect that's what people are getting at by saying it's "not streaming": it's built as a batch process but, under some circumstances, you can run it fast enough to get away with pretending that it isn't.