Hacker News new | ask | show | jobs
by thutch76 37 days ago
I didn't make it all the way through the post, but I have to say I think he fundamentally understands the purpose of WebRTC. He calls himself an expert, and yeah he's written SFU's in go and rust and different companies ... but his technical credentials do not mean he's correct.

Maybe it's a comprehension issue on my end, but he seems to associate things like stun and dtls as related, compounding issues (particularly in round trip time), but they are really orthogonal.

Also, he spends too much time talking about how you can't resend packets, and reiterates that point by stating they tried really hard (at discord?). That's where he lost the plot, imo.

The RTC in WebRTC is about real time communication. Humans will naturally prefer the auditory experience of an occasional dropped packet, vs backed up audio or audio that plays at an uneven rate. To clarify, I'm talking about human speech here.

If you want to tolerate packet loss, use a protocol based on tcp instead of udp. But you know what happens when you send audio over poor network conditions with tcp? There will be pauses on the receiving end as it waits for the next correct packet. Let's say the delay is multiple seconds. What should the receiving end do when packets start flowing again? Plays the clogged audio at a natural clock? Attempt to play the audio back at a higher rate to "catch up" with any other channels? People, humans, do not generally prefer that experience.

Forget about WebRTC for a minute, but instead think about tcp vs udp for voice. Voip has been based on udp since the 90's for a reason.

2 comments

I think you're not really engaging with his point, which is that RTC is a poor fit for communicating with an AI agent. I didn't read the blog as claiming that WebRTC is bad for what it is, only that it's a (very) poor choice for a voice-to-AI application.
That's fair. My attention wanted and I lost the plot.

However, I don't think having an agent on one side necessarily changes anything. Network problems are not predictable, particularly on mobile, so the human is still very likely to experience a poor auditory experience on a tcp connection.

The difference is that the agent doesn’t run in realtime. If 20 packets are lost and resent, the agent can still process them almost instantly and reply, in contrast to a human. Only the direction from the agent to the human needs to be realtime.
Only if you expect to interact with the agent in a turn-taking format, with (possible) pauses between every turn.

ChatGPT’s voice mode is like speaking to someone in real time on a voice call, not input -> output.

> Humans will naturally prefer the auditory experience of an occasional dropped packet, vs backed up audio or audio that plays at an uneven rate

Yes but the difference here is there is only one human in the conversation. The other side can tolerate a 200ms delay in receiving or sending perfectly fine because it is not constrained to run in exactly real time like a human brain is.

I think he is right. This is an interesting point that I haven't considered before. The reason we skip 200ms instead of pausing for 200ms when we get missed packets in a WebRTC call is because we can't pause the human on the other side of the call. But we can pause AI just fine.

> The reason we skip 200ms instead of pausing for 200ms when we get missed packets in a WebRTC call is because we can't pause the human on the other side of the call. But we can pause AI just fine.

This isn't about pausing anyone; it's about doing faster-than-realtime processing after a delay event. Humans can do that to some extent, and this is in fact done with some voice applications like Microsoft Teams, where after a network interruption the audio is sometimes played back really fast until the point that it becomes real-time again.

I hope it's an intentional design decision, because it works really well (for me). I can often perfectly keep track of a conversation in spite of the network delay. As much as I hate Teams, its meetings and voice implementation (also noise cancellation) works quite well, especially compared to current open source solutions like Jitsi or BigBlueButton.

Yes, it's about pausing. You pause the AI so it doesn't need to perceive the 200ms gap at all, unlike a human who will always perceive the interruption. Yes, then you run faster than real time to catch up.

Yes, humans can listen to audio faster than real time to catch up, but it degrades the experience and there is a fairly low limit to it. When talking to an AI you don't have to skip or speed up at all on the human side, is the point.

Yeah that's a really good way of framing the argument, I wish I wrote that. The way robots listen/respond is bounded by compute, not time. Buffering audio isn't a great experience for humans but definitely works for robots.
i haven't used the openai voice thing

but, if it's trying to respond in a natural way, with interruptions in both directions, it may still be a good idea. if there's a delay between you stopping and it starting talking, it feels weird

(you might be able to fake some of that on the client, but then you need a thicker client)

Which LLM can generate text so quickly a real-time conversation is viable?
There are now realtime “speech-to-speech” models [0]. I believe they skip text to streamline the architecture.

[0]: https://openai.com/index/introducing-gpt-realtime/