|
|
|
|
|
by opprobium
704 days ago
|
|
It is not streaming in the way people normally use this term. It's a fuzzy notion but typically streaming means something encompassing: - Processing and emitting results on something closer to word by word level
- Allowing partial results while the user is still speaking and mid-segment
- Not relying on an external segmenter to determine the chunking (and therefore also latency) of the output. |
|
I don't want to make too strong of claims given NDAs (in reality, my failing memory :P) but I'm 99% sure inference on-device is just as fast as SODA. I don't know what to say because I'm flummoxed, it makes sense to me that Whisper isn't as good as SODA, and I don't want to start banging the table about that its no different from a user or client perspective, I don't think that's fair. There's a difference in model architecture and it matters. I think its at least a couple WER behind.
But then where's the better STT solutions? Are all the obviously much better solutions really all locked up? Picovoice is the only closed solution I know of available for local dev, and per even them, it's only better than the worst Whisper. Smallest is 70 MB in ONNX vs. 130 MB for next step up, both inference fine with ~600 ms latency from audio byte to mic to text on screen, ranging from WASM in web browser to 3 year old Android phone.