I think he’s saying they are doing an insane level of complexity to shave ~100ms off response times in a scenario where that isn’t important and might even be a negative
When GP mentioned reducing conversational latency as a negative that made sense (and should probably be done IMO), it just wasn't the same category of latency the article talks about reducing. I.e. increasing "network latency" just makes the conversation feel more and more out of sync, it doesn't change the rate at which the AI will interrupt ("turn latency") because the latter is based on the duration of the pause in the audio stream, not the duration it took to deliver that audio stream.
If you meant there is a case where reducing the network latency at the same delivery reliability for a given audio stream is actually a negative then I'd love to hear more about it as I'm a network guy always in search of an excuse for latency :D.
> If you meant there is a case where reducing the network latency […] is actually a negative then I'd love to hear more about it
That is exactly what parent did:
> they are doing an insane level of complexity to shave off ~100 ms
The downside is everything they had to do to achieve it, and maintaining that capability going forward, when the product can tolerate much more. It is the definition of premature optimization.
It just maybe isn’t at a level where it is relevant in your argument/decision space in IT.
By you want to be able to interject “hold on…” and have it immediately stop talking, when it goes off the rails.
And GP is correctly pointing out that the only negative here (silence waiting latency maybe being too low) is tunable separately from the network latency number.
I want to be able to click the "Stop" button on my earphones remote. I want to be able to interject "woah" or "stop!" or "wait!" or that it would detect that I've inhaled a breath, or that my eyes glazed over. I want the LLM to figure out that every speed setting for its voice output is in "auctioneer" territory rather than "lecturing university professor with tenure and a pension" pacing.
But we won't get any of that, because the prime directive of LLMs is to burn tokens like there's no tomorrow. Burn tokens on a naïve answer without asking clarifying questions. Burn tokens on writing, debugging, and running a Python script or accessing and parsing 10 websites without asking for consent. Burn tokens on half-baked images with misspellings and 31 fingers. Burn tokens arguing "how many 'r's in strawberry?". Burn tokens asking a followup question at the end of every single answer, begging the user to re-engage and burn more tokens.
There is a little red "Stop" control when text output is being produced, at least, but does "Stop" halt everything and throw away the context? Re-prompt from the beginning?
The "maximize tokens burnt" prime directive is not to be found in any system prompt or user documentation. It is seemingly a common feature of the training for any consumer model.
Currently, if I'm using voice for an LLM, I use the voice dictation in the keyboard feature, because then the response is in text. There is no way to prevent "responding in kind" if I query the thing with audio. Or in Swahili.
you actually don’t want it to immediately stop because people say things like “hm” “yeh” during machine output. Maybe you say “no” to someone next to you and don’t want to interrupt output.
To confidently interrupt I would want to assert that the user has been speaking for > N time. You could do other things like parse a streaming transcription for keywords but generally it feels like bad UX to me to cut output the second input is detected. Letting the user talk for 1-2s gives a much stronger signal and it isn’t too weird for someone to keep talking for 1.5s after you start.
If you meant there is a case where reducing the network latency at the same delivery reliability for a given audio stream is actually a negative then I'd love to hear more about it as I'm a network guy always in search of an excuse for latency :D.