Hacker News new | ask | show | jobs
by Orphis 19 days ago
One of the interesting usage of AV1 was specifically for low bitrate calls, and software encoding was perfectly fine, even on mobile.

With low enough resolution, framerate and bitrate, you can get a quality stream without significant encoding artifacts compared to any other codec. It is in production right now and has been for a while.

The tradeoff CPU / bandwidth is quite advantageous in situations like this. And no, AV1 HW encoders cannot usually be used, they are not designed for a tight bitrate control or realtime communications like software encoding is usually.

2 comments

> One of the interesting usage of AV1 was specifically for low bitrate calls, and software encoding was perfectly fine, even on mobile.

You really want hardware decoding on mobile, otherwise you end up with 40 minutes battery life. Fortunately, for typical videoconference resolutions, VP8 and H.264 are just fine. AV1 is nice to have, though, due to excellent support for synthetic content (screen sharing), and for scalable video coding (a much more elegant solution than simulcast, IMHO).

In the world I live in, the general plan is to stick to VP8 and H.264 for the time being, and to skip to AV1 when it's universally available on mobile. I haven't seen any features of AV2 which would justify waiting for it.

No, you do NOT want hardware anything on mobile if you are targeting smaller bitrate that are not that taxing on the CPU, when the conditions are otherwise so bad that the call would either drop or be unusable. HW encoders produce bad results at low bitrate. HW decoders usually have issues with the temporal encodings used and they may also just not accept those streams (a lot of test scenarios are movies, and the RTC tools are poorly supported).

I worked on shipping it to Chromium, WebRTC and Google Meet many years ago and we had many publications about it: - https://blog.google/products-and-platforms/products/duo/4-ne... - https://webrtchacks.com/the-hidden-av1-gift-in-google-meet/

The use case is not screensharing or a large conference room, but mainly a simpler talking face for a 1:1 chat, but with good quality as packet loss is then not as impactful on a 30KBps stream with AV1 than a 50KBps VP8 stream.

> we had many publications about it

I'd be interested in learning more, but the links you provide are just advertising copy. Could you please provide links to actual technical articles on your conclusions?

The internals are usually confidential and it's hard to find an engineer willing to make a comprehensive write-up about those: they want to make tech and not spend time proofing a tech write-up for public consumption (they already had to make an internal one!).

So the middle ground is that you have those "marketing" copies that demo the tech. One of the telling part of those is how you can get a fine usable 30KBps stream at very low bitrate with AV1 compared to a higher bitrate H264 that is unusable. It doesn't tell you that because you are using a lot less bytes, you will be trading CPU power consumption for radio power consumption and it's a tricky comparison, but in general, it's a favorable trade for the user who has very bad network conditions and is trying to make a call. The goal is to make the call work at all cost, not to save the battery and having a useless stream of data transferred.

> HW encoders produce bad results at low bitrate.

Is that poor implementation or is it inherently harder to implement in hw encoders?

There's a few reasons, I suspect fixed resource depth might be factor in poor hardware single pass encoding ...

What does limit them, though, is pseudo real time single pass pipelines.

I see the best encoding results from two pass - one fast run to work out the easy compress and hard compress parts of a video and then a second pass to get the optimal results on a stream that's already got a budget in mind for each section through the advantage of foresight as to what's left to do.

As someone else said, it's poor single pass encoding performance targeted for the tools used in real-time communications. This type of usage is "new" to hardware manufacturers and they poorly test it as it's easier to make a chip good enough for decoding the general case for watching your favorite movie platform than do something comprehensive.

One aspect of real-time encoding is that the frames are not ordered or structured the same linear way as they used to be in older format. Now, we have temporal and spatial encoding, which allows for better frame drops or efficiency or a stream that is decodable at multiple resolutions at the same time.

An example of temporal encoding is that you have a sequence of frames at 15fps (T0) that are all referencing the previous one, and sometimes an I-frame that is a full independent picture you can start decoding from. Then, you can have another temporal layer (T1) , where for every frame at the base 15 fps layer (T0), you insert a new frame that depends on it. You end up having a 30 fps stream! And if your network connection is worse, or you hardware can't keep up, you can drop the T1 layer and only use the T0 layer. This works great for real-time! In the specs, you could have more layers with more complex dependency chains, but 3 layers is as high as you want to go.

Spatial encoding is a bit different, you will have frame at the highest resolution, but they reference another frame at half the resolution (who may also do the same). Each higher layer means just adding more details over the smaller size frame that you have at the base. To decode an image, you need to have all the frames available. This can also be combined with the temporal encoding above. While this isn't useful for a 1-1 communication, in conference rooms, it's a great optimization as while you may send your full HD picture to the server, you may not want to send that to everyone when you're just a thumbnail who is not actively speaking. So the conference server will not send the full HD picture, but the lower resolution only. And since you don't want to do the encoding on the server (it's expensive, slow and you need to trust the intermediate service with your secret stuff), doing spatial encoding on the client side is better.

Those techniques are all advanced ones that would be used if available universally. Unfortunately, a lot of hardware decoders choke on those, despite being part of the specs. And it's not that they can't generate a stream with those, they also sometimes can't decode them (breaking the spec).

And finally, the hardware encoders are tuned for higher bitrate work. Ask them to do a 3MBps stream, they'll do fine. Ask them for a 30KBps stream, they'll make garbage most of the time.

Have you said this for Audio Codec I would have agreed. I do not know a single Smartphone Video Conferencing software that uses CPU encoding rather than hardware encoding. Neither WhatsApp or FaceTime, perhaps the largest of the two real time Video Call uses AV1.
Yeah, no production or large scale VC system is running software AV1 encoders on smartphones. You will drain a full phone battery in 1-2 hours of calls.

It just doesn’t make sense and will result in extraordinary power/battery drainage at best, or output that’s worse than hardware encoding.

The only way you could get AV1 to software encode in realtime AND low latency on a mid-range Android chip is by disabling or skipping nearly all of the compression/encoding features that make it good at low bitrate.

> Yeah, no production or large scale VC system is running software AV1 encoders on smartphones. You will drain a full phone battery in 1-2 hours of calls.

Yeah but, half jokingly, Zoom does that (draining the battery in an hour) already :P

So, status remains quo, the commons remain tragic, and glory to H.264 forever?
> tragic

H.264 isn't even that bad at all, if not the best depending on how you look at it. Our Internet bandwidth, both on the backend and front end on Mobile 5G is increasing with plenty more room to grow. While computation decoding and storage isn't.

i.e If bandwidth is infinite and free, and we are only optimising for decoding power usage. H.264 wins in a lot of this scenario.

H264 is lacking a lot of features (behind patents) that are essential to real-time communications. It's available, but by far, the worst offender for call quality. Modern call technology will want to use temporal and spatial scaling which are not available in the profiles supported by most H264 encoders and decoders.

Those tools are available for VP8 (temporal only), VP9 and AV1 and improve the quality of calls quite a lot when used right. I don't know about about the internals of H265 and H266 as those are also behind patents and no one wanted to touch them in the real-time conferencing space.

At least until a better codec has widespread enough hardware support, I think.
Google Meet can do it. You don't want the full conference with AV1, just use it for very low bitrate scenarios with a high packet loss possibly. Phones are a good target system. And I know this is quite opposite to expectations.

It is a lot better to send a stable and visually ok stream with AV1 at 30KBps than fail to send a VP8 50KBps stream that is unusable anyway and is subject to twice as many packet lost than a lower bitrate solution.

It is possible they use AV1 in other scenarios now, but I left the team a while back now and I haven't checked what they are now using under the hood.