Hacker News new | ask | show | jobs
by kixelated 38 days ago
HELLO MR SEAN,

1. Of course users want lower latency, but they also want fewer instances where the LLM "misheard" them. It would be amazing to run A/B experiments on the trade-off between latency vs quality, but WebRTC makes that knob difficult to turn.

2. I'm obviously not an TTS expert, but what benefit is there to trickling out the result? The silicon doesn't care how quickly the time number increments?

3. Yeah, sometimes the client is aware when their IP changes and can do an ICE renegotiation. But often they aren't aware, and normally would rely on the server detecting the change, but that's not possible with your LB setup. It's not a big deal, just unfortunate given how many hoops you have to jump through already.

4. Okay, that draft means 7 RTTs instead of 8 RTTs? Again some can be pipelined so the real number is a bit lower. But like the real issue is the mandatory signaling server which causes a double TLS handshake just in case P2P is being used.

5. Of course WebRTC is easier for a new developer because it's a black box conferencing app. But for a large company like OpenAI, that black box starts to cause problems that really could be fixed with lower level primitives.

I absolutely think you should mess around with RTP over QUIC and would love to help. If you're worried about code size, the browser (and one day the OS) provides the QUIC library. And if you switch to something closer to MoQ, QUIC handles fragmentation, retransmissions, congestion control, etc. Your application ends up being surprisingly small.

The main shortcoming with RoQ/MoQ is that we can't implement GCC because QUIC is congestion controlled (including datagrams). We're stuck with cubic/BBR when sending from the browser for now.

4 comments

Latency versus reliability is a false dichotomy anyway. The alternative to WebRTC isn't to wait for the user to finish speaking before you send any of the audio. Open a websocket and send the coded audio packets as they're generated. Now you're still sending audio packets immediately, but if one is dropped, TCP retransmits it until it makes it through. If the connection is really slow, packets queue up, and the user has to wait, but it still works. You get the low latency in the best case and the robustness in the worst case.
This is how Nova Sonic works. Having done some implementations it’s trickier than you might like (e.g. the Python library for Sonic had problems with echoes and we had to use the Java library)
You ultimately still need a jitter buffer large enough to absorb retransmisiones. Otherwise you’ve got stuttering audio. And dynamically adjusting this jitter buffer is hard
I'm not an expert. Can't we abuse that LLMs don't need to receive audio as a continuous stream without interruptions? Couldn't we just send data and pipe it into the LLM with deduplication (if resending happens)?

  x...y...y[dedup]...z
You’re absolutely correct. A jitter buffer is necessary for a human listener, but a LLM isn’t aware of a time lapse, just like it isn’t aware of the time since your last message in the conversion (unless the chat harness explicitly informs it).
Audio -> ASR - no jitter buffer TTS -> human - jitter buffer
> And dynamically adjusting this jitter buffer is hard

Unappreciated part of this entire conversation.

1.) Latency vs quality doesn't come up enough to make people want to A/B test it unfortunately. At work I would say ~5 people care about WebRTC vs QUIC vs X. All effort is around the models (how can I provide tools to be support those doing that work)

2.) The model isn't processing just text anymore. Also taking into account breathing/emotion etc... not just spitting out big responses anymore. As it generates them it is taking into account the users response.

3.) It works with the LB setup today. Clients are sending ICE traffic, if it roams we lookup the ufrag and route appropriately.

4.) With DTLS 1.3 it is 1 RTT with SNAP[0] for WebRTC session. SCTP info goes in Offer/Answer, DTLS is packed into ICE. You are totally right about signaling though! [1] was my answer for doing WebRTC without signaling, couldn't get anyone to care though.

5.) I don't have anything that I need to tune. If I want to increase (or decrease) latency [3] is something I put into Transceiver. Otherwise I can't think of any 'change this WebRTC behavior' that has been asked by users/developers.

[0] https://datatracker.ietf.org/doc/draft-hancke-tsvwg-snap/

[1] https://github.com/pion/offline-browser-communication

[3] https://webrtc.googlesource.com/src/+/refs/heads/main/docs/n...

Human spoken conversation doesn’t really work like file buffering.

People can tolerate missing words surprisingly well. If a phrase is slightly clipped, masked by noise, or dropped, the listener can often infer it from context. That happens constantly in real speech.

But pauses and stalls are much more damaging. A sudden freeze in the middle of speech breaks turn-taking, timing, and attention. It feels like the speaker stopped thinking, the connection died, or the system got stuck.

For voice UX, a tiny omission is often less harmful than a perfectly complete sentence that freezes halfway.

I think this is mixing domains quite a bit;

If I'm talking to a friend or peer and I'm on a crappy link, we can probably work it out. If I'm calling my lawyer from prison with my "one call" I really want my lawyer to get my instructions clearly and correctly, ideally the first time without a lot of coaching.

Where on this scale does "person talking to LLM" fit?

I believe there's a ton of research into the shannon limit and human speech. You can trivially observe how much redundancy there is by listening to a podcast at 1x, 1.2x, 1.5x, 2x, etc, and when you can't follow what's going on, you've found the "redundancy" built into that language. This number falls way off when you're listening to a person with an accent or when the recording is noisy or whatever.

You'll also find that your tolerance for lossy media is radically different based on latency and echos and jitter in the audio (which I believe is the point of the original "don't use webrtc" article...)

Finally, people may tolerate this, but the "phonem to token" thinger may be less tolerant, and will certainly not be able to magic correct meaning from lost packets, and if the resulting exchange is extremely expensive or important (from the lawyer and the "I'm in jail in poughkeepsie; I need bail!" exchange) you really want to take the time to get it right, not make things guess.

> People can tolerate missing words surprisingly well. If a phrase is slightly clipped, masked by noise, or dropped, the listener can often infer it from context. That happens constantly in real speech.

LLMs are surprisingly good at this, too.

This entire blog post is based on assumptions that

1) WebRTC garbling is common

2) LLMs fall apart if there are any audio glitches

I would bet money that OpenAI explored and has statistics on both of those and how it impacts service. More than this blogger heaping snark upon snark to avoid having a realistic conversation about pros and cons

The misunderstanding the user comes down to understanding how the user prompts and what type of responses the user gets in return. I’m wondering if for anything code the llm could have an interrupter that it would first read what the user wrote and translate it to proper sentence structure and the return a truer value. I think the llm is having an understanding issue because everyone has a unique signature in how they explain something. That signature operates like a personal language of the user as to which most of us will run through different scenarios to come up with a conclusion from our personal signature/language in which we conduct ourselves. And since llms are gamed to get to the answer faster using less tokens it probably picks the average high level signature that can be used for multiple users.