Hacker News new | ask | show | jobs
by bob1029 2188 days ago
In my approach, these would be 2 completely independent streams. I haven't implemented audio yet, but hypothetically you can continuously adjust the sample buffer size of the audio stream to be within some safety margin of detected peak latency, and things should self-synchronize pretty well.

In terms of encoding the audio, I don't know that I would. For video, going from MPEG->JPEG brought the perfect trade-off. For reducing audio latency, I think you would just need to be sending raw PCM samples as soon as you generate them. Maybe in really small batches (in case you have a client super-close to the server and you want virtually 0 latency). If you use small batches of samples you could probably start thinking about MP3, but raw 44.1KHz 16-bit stereo audio is only 1.44 mbps. Most cellphones wouldn't have a problem with that these days.

Edit: The fundamental difference in information theory regarding video and audio is the dimensionality. JPEG makes sense for video, because the smallest useful unit of presentation is the individual video frame. For audio, the smallest useful unit of presentation is the PCM sample, but the hazard is that these are fed in at a substantially higher rate (44k/s) than with video (60/s), so you need to buffer out enough samples to cover the latency rift.

2 comments

Discord does something like what you describe. It's kind of awful for music(e.g. if it's a channel with a music bot) as you'll hear it speed up and slow down in an oscillating pattern. The same effect also appears in games if you should have a game loop that always tries to catch up to an ideal framerate by issuing more updates to match an average - the resulting oscillation as the game suddenly slows down and then jerks forward is hugely disruptive, so it's not really done this way in practice.

Oscillations are the main issue with "catch-ups" in synchronization, and dropping frames once your buffer is too far behind is often a more pleasant artifact. It's not really a one-size-fits-all engineering problem.

Audio conferencing at low latency is already solved by things like Mumble (https://www.mumble.info/). I think adding a video feed in complete parallel (as in, use mumble as-is, do the video in another process) with no regard for latency would be a pretty good first step to see what can be achieved.