| HN Mirror

(Speaker in that video here)

You're right that this is measuring the WebRTC.org codebase (as used in Chrome, Firefox, etc.), not necessarily the WebRTC protocol. Better URL and demo/talk videos are here: https://snr.stanford.edu/salsify

But the issue probably isn't with the bandwidth estimator or congestion-control algorithm -- you probably can't fix this by taking some WebRTC implementation and plugging in better ones. The core issues as we see them are about the architecture of the WebRTC.org codebase, and frankly all WebRTC source/sink implementations we're aware of, in particular:

(a) even with perfect bandwidth estimation, libvpx and libx264 (and, we think, typical hardware encoders) as configured are bad at achieving the requested bitrate over short timescales, meaning that "overlarge" coded frames are regularly being sent, and techniques like reference invalidation or golden/altref encoding are never (?) used to skip sending an overlarge coded frame or to retry encoding the same frame at a lower quality before sending. [At a technical level, the interface between the encoder VBV buffer model and the bandwidth estimator/CC algorithm is very inconvenient -- to have these two control loops running independently, trying to do similar things at similar timescales, isn't great.]

(b) loss recovery does not work well, and again features like reference invalidation to recover quickly do not seem to be well-used in practice (the second half of https://youtu.be/jaDelb4JnP4 makes this pretty clear),

(c) the WebRTC.org codebase is so complex, with so many modes that it can get settled in, that trying to reason about these behaviors or explore them systematically is quite challenging, and

(d) because there are several layers of buffering on the receiver-side, and because the sender-side code will change things like the camera's frame rate, it's hard to measure application-level metrics [e.g. lens-to-display and microphone-to-speaker latency] robustly in a deployment, especially across a diverse hardware or OS base. (And it's easy to get a false sense of security from the network-level WebRTC metrics that are available.)

It's probably possible to produce a WebRTC source/sink implementation that works better over challenged networks and has good monitoring of application-level latency, but it would be a big job afaik. Our work was partly funded by Google, we had a high-up Google sponsor, we gave multiple talks at Google, etc., but it was challenging even to find "the people in charge" of this cross-modular stuff to talk with them, because I think the codebase in some respects mirrors the org chart. E.g. you have video compression people worrying about the encoder (and wanting to be able to plug in libvpx, libx264, and a bunch of hardware encoders to the same interface), and networking people worrying about the bandwidth estimator and CC algorithm, and it's sort of way too late to say that the interfaces or architecture needs to be refactored or that the complexity has gotten out of control. To Google's credit, they have since driven the industry to produce standardized APIs for "functional" codecs, and functional decoder ASICs now exist (not sure about encoders yet), so there is progress being made on that front at least.