Hacker News new | ask | show | jobs
by modeless 38 days ago
> Humans will naturally prefer the auditory experience of an occasional dropped packet, vs backed up audio or audio that plays at an uneven rate

Yes but the difference here is there is only one human in the conversation. The other side can tolerate a 200ms delay in receiving or sending perfectly fine because it is not constrained to run in exactly real time like a human brain is.

I think he is right. This is an interesting point that I haven't considered before. The reason we skip 200ms instead of pausing for 200ms when we get missed packets in a WebRTC call is because we can't pause the human on the other side of the call. But we can pause AI just fine.

3 comments

> The reason we skip 200ms instead of pausing for 200ms when we get missed packets in a WebRTC call is because we can't pause the human on the other side of the call. But we can pause AI just fine.

This isn't about pausing anyone; it's about doing faster-than-realtime processing after a delay event. Humans can do that to some extent, and this is in fact done with some voice applications like Microsoft Teams, where after a network interruption the audio is sometimes played back really fast until the point that it becomes real-time again.

I hope it's an intentional design decision, because it works really well (for me). I can often perfectly keep track of a conversation in spite of the network delay. As much as I hate Teams, its meetings and voice implementation (also noise cancellation) works quite well, especially compared to current open source solutions like Jitsi or BigBlueButton.

Yes, it's about pausing. You pause the AI so it doesn't need to perceive the 200ms gap at all, unlike a human who will always perceive the interruption. Yes, then you run faster than real time to catch up.

Yes, humans can listen to audio faster than real time to catch up, but it degrades the experience and there is a fairly low limit to it. When talking to an AI you don't have to skip or speed up at all on the human side, is the point.

Yeah that's a really good way of framing the argument, I wish I wrote that. The way robots listen/respond is bounded by compute, not time. Buffering audio isn't a great experience for humans but definitely works for robots.
i haven't used the openai voice thing

but, if it's trying to respond in a natural way, with interruptions in both directions, it may still be a good idea. if there's a delay between you stopping and it starting talking, it feels weird

(you might be able to fake some of that on the client, but then you need a thicker client)

Which LLM can generate text so quickly a real-time conversation is viable?
There are now realtime “speech-to-speech” models [0]. I believe they skip text to streamline the architecture.

[0]: https://openai.com/index/introducing-gpt-realtime/