|
|
|
|
|
by opprobium
703 days ago
|
|
Streaming for TTS doesn't matter but for speech to text it is more meaningful in interactive cases. In that case the user's speech is arriving in real time and streaming can mean a couple levels of things: - Overlap compute with the user speaking: Not having to wait until all the speech has been acquired can massively reduce latency at the end of speech and allow a larger model to be used. This doesn't have to be the whole system, for instance an encoder can run in this fashion along audio as it comes in even if the final step of the system then runs in a non-streaming fashion. - Produce partial results while the user is speaking: This can be just a UI nice to have, but it can also be much deeper, eg, a system can be activating on words or phrases in the input before the user is finished speaking which can dramatically change latency. - Better segmentation: Whisper + Silero is just using VAD to make segments for Whisper, this is not at all the best you can do if you are actually decoding while you go. Looking at the results as you go allow you to make much better and faster segmentation decisions. |
|
Until these, you'd use echo cancellation to try and allow interruptible dialogue, and thats unsolved, you need a consistently cooperative chipset vendor for that (read: wasn't possible even at scale, carrots, presumably sticks, and with nuch cajoling. So it works on iPhones consistently.)
The partial results are obtained by running inference on the entire audio so far, and silence is determined by VAD, on every stack I've seen that is described as streaming
I find it hard to believe that Google and Apple specifically, and every other audio stack I've seen, are choosing to do "not the best they can at all"