Hacker News new | ask | show | jobs
by cossatot 2193 days ago
I think a larger issue is the focus on video as opposed to audio. Audio may be less sexy but it is far and away more important for most interpersonal communication (I'm not discussing gaming or streaming or whatever, but teleconferencing). Most of us don't care that much if we get super crisp, uninterrupted views of our colleagues or clients, but audio problems really impede discussion.
2 comments

Video is related to this though. If audio is synced to the video then a delayed video stream also means a delayed audio stream.
In my approach, these would be 2 completely independent streams. I haven't implemented audio yet, but hypothetically you can continuously adjust the sample buffer size of the audio stream to be within some safety margin of detected peak latency, and things should self-synchronize pretty well.

In terms of encoding the audio, I don't know that I would. For video, going from MPEG->JPEG brought the perfect trade-off. For reducing audio latency, I think you would just need to be sending raw PCM samples as soon as you generate them. Maybe in really small batches (in case you have a client super-close to the server and you want virtually 0 latency). If you use small batches of samples you could probably start thinking about MP3, but raw 44.1KHz 16-bit stereo audio is only 1.44 mbps. Most cellphones wouldn't have a problem with that these days.

Edit: The fundamental difference in information theory regarding video and audio is the dimensionality. JPEG makes sense for video, because the smallest useful unit of presentation is the individual video frame. For audio, the smallest useful unit of presentation is the PCM sample, but the hazard is that these are fed in at a substantially higher rate (44k/s) than with video (60/s), so you need to buffer out enough samples to cover the latency rift.

Discord does something like what you describe. It's kind of awful for music(e.g. if it's a channel with a music bot) as you'll hear it speed up and slow down in an oscillating pattern. The same effect also appears in games if you should have a game loop that always tries to catch up to an ideal framerate by issuing more updates to match an average - the resulting oscillation as the game suddenly slows down and then jerks forward is hugely disruptive, so it's not really done this way in practice.

Oscillations are the main issue with "catch-ups" in synchronization, and dropping frames once your buffer is too far behind is often a more pleasant artifact. It's not really a one-size-fits-all engineering problem.

Audio conferencing at low latency is already solved by things like Mumble (https://www.mumble.info/). I think adding a video feed in complete parallel (as in, use mumble as-is, do the video in another process) with no regard for latency would be a pretty good first step to see what can be achieved.
Early versions of Youtube nailed this. The video would frequently pause, degrade, or glitch due to buffering delays but the audio would continue to play. This made all the difference in user perception: youtube felt smooth. Other streaming services would pause both video and audio which did not feel smooth at all. Maybe they had some QoS code in their webapp to prioritize audio?