Hacker News new | ask | show | jobs
by adgjlsfhk1 21 days ago
Based on AV1's trajectory, hardware encode isn't necessary (though it is nice). The current encoder is a reference encoder. Now that the spec is finalized, expect significant speed improvements from production encoders (realtime likely won't happen until we get it in hardware though)
3 comments

I strongly disagree with it not being required. I run a small social news site - AV1 is still prohibatively expensive both for the server and clients for software encoding/decoding. Without hardware encoding, the tradeoff for better compression ratios in exchange for massive battery use + very long processing times for encoding simply isn't worth it. In order to get AV1 out, I have to often process a h264 version of a video first anyway, just so the client isn't left waiting for their video upload to finish encoding. This means to support AV1 I'm not saving anything on the storage side. Even youtube only does AV1 encodes for extremely popular videos - it only makes sense to do at significant scale.

I love AV1, don't get me wrong, and I can't wait til I can switch over to it as a single unified format for both images and video, but for now the cost is too high until hardware acceleration becomes ubiquitous

I went and checked some youtube videos on my front page, A video with 15k views had an AV1 encode, while a video with 160 views was h264 only. So extremely popular videos is not how I would describe it, probably by views, almost everything you watch on youtube is AV1. But they skip the extra encodes for videos relatively no one watches.
Last time I checked they do it for new videos only. Older videos with 1M views aren't even on AV1.
Makes sense, new videos are where most of the streams are happening, I wouldn't be surprised if they start to reduce the number of transcodes to save space as a video drops in popularity. h264 will work on everything so they need that as a minimum, with AV1 just being there to save on data transfer.
Thanks You. I have been saying this since the launch of AV1 on HN, Doom9 and other places. I wanted to mention even Google uses custom dedicated hardware ASIC for AV1 encode.

I wish LCEVC is more widespread. For the same H.264 encoding time you get 50% to 60% Bitrate reduction using it with H.264.

Hardware encode is required if you want things like video calls, camera recording and such to use it.

It isn’t required for content distribution platforms which aren’t realtime and the cost of encode is easily made up by hundreds of thousands of streams.

One of the interesting usage of AV1 was specifically for low bitrate calls, and software encoding was perfectly fine, even on mobile.

With low enough resolution, framerate and bitrate, you can get a quality stream without significant encoding artifacts compared to any other codec. It is in production right now and has been for a while.

The tradeoff CPU / bandwidth is quite advantageous in situations like this. And no, AV1 HW encoders cannot usually be used, they are not designed for a tight bitrate control or realtime communications like software encoding is usually.

> One of the interesting usage of AV1 was specifically for low bitrate calls, and software encoding was perfectly fine, even on mobile.

You really want hardware decoding on mobile, otherwise you end up with 40 minutes battery life. Fortunately, for typical videoconference resolutions, VP8 and H.264 are just fine. AV1 is nice to have, though, due to excellent support for synthetic content (screen sharing), and for scalable video coding (a much more elegant solution than simulcast, IMHO).

In the world I live in, the general plan is to stick to VP8 and H.264 for the time being, and to skip to AV1 when it's universally available on mobile. I haven't seen any features of AV2 which would justify waiting for it.

No, you do NOT want hardware anything on mobile if you are targeting smaller bitrate that are not that taxing on the CPU, when the conditions are otherwise so bad that the call would either drop or be unusable. HW encoders produce bad results at low bitrate. HW decoders usually have issues with the temporal encodings used and they may also just not accept those streams (a lot of test scenarios are movies, and the RTC tools are poorly supported).

I worked on shipping it to Chromium, WebRTC and Google Meet many years ago and we had many publications about it: - https://blog.google/products-and-platforms/products/duo/4-ne... - https://webrtchacks.com/the-hidden-av1-gift-in-google-meet/

The use case is not screensharing or a large conference room, but mainly a simpler talking face for a 1:1 chat, but with good quality as packet loss is then not as impactful on a 30KBps stream with AV1 than a 50KBps VP8 stream.

> we had many publications about it

I'd be interested in learning more, but the links you provide are just advertising copy. Could you please provide links to actual technical articles on your conclusions?

The internals are usually confidential and it's hard to find an engineer willing to make a comprehensive write-up about those: they want to make tech and not spend time proofing a tech write-up for public consumption (they already had to make an internal one!).

So the middle ground is that you have those "marketing" copies that demo the tech. One of the telling part of those is how you can get a fine usable 30KBps stream at very low bitrate with AV1 compared to a higher bitrate H264 that is unusable. It doesn't tell you that because you are using a lot less bytes, you will be trading CPU power consumption for radio power consumption and it's a tricky comparison, but in general, it's a favorable trade for the user who has very bad network conditions and is trying to make a call. The goal is to make the call work at all cost, not to save the battery and having a useless stream of data transferred.

> HW encoders produce bad results at low bitrate.

Is that poor implementation or is it inherently harder to implement in hw encoders?

There's a few reasons, I suspect fixed resource depth might be factor in poor hardware single pass encoding ...

What does limit them, though, is pseudo real time single pass pipelines.

I see the best encoding results from two pass - one fast run to work out the easy compress and hard compress parts of a video and then a second pass to get the optimal results on a stream that's already got a budget in mind for each section through the advantage of foresight as to what's left to do.

As someone else said, it's poor single pass encoding performance targeted for the tools used in real-time communications. This type of usage is "new" to hardware manufacturers and they poorly test it as it's easier to make a chip good enough for decoding the general case for watching your favorite movie platform than do something comprehensive.

One aspect of real-time encoding is that the frames are not ordered or structured the same linear way as they used to be in older format. Now, we have temporal and spatial encoding, which allows for better frame drops or efficiency or a stream that is decodable at multiple resolutions at the same time.

An example of temporal encoding is that you have a sequence of frames at 15fps (T0) that are all referencing the previous one, and sometimes an I-frame that is a full independent picture you can start decoding from. Then, you can have another temporal layer (T1) , where for every frame at the base 15 fps layer (T0), you insert a new frame that depends on it. You end up having a 30 fps stream! And if your network connection is worse, or you hardware can't keep up, you can drop the T1 layer and only use the T0 layer. This works great for real-time! In the specs, you could have more layers with more complex dependency chains, but 3 layers is as high as you want to go.

Spatial encoding is a bit different, you will have frame at the highest resolution, but they reference another frame at half the resolution (who may also do the same). Each higher layer means just adding more details over the smaller size frame that you have at the base. To decode an image, you need to have all the frames available. This can also be combined with the temporal encoding above. While this isn't useful for a 1-1 communication, in conference rooms, it's a great optimization as while you may send your full HD picture to the server, you may not want to send that to everyone when you're just a thumbnail who is not actively speaking. So the conference server will not send the full HD picture, but the lower resolution only. And since you don't want to do the encoding on the server (it's expensive, slow and you need to trust the intermediate service with your secret stuff), doing spatial encoding on the client side is better.

Those techniques are all advanced ones that would be used if available universally. Unfortunately, a lot of hardware decoders choke on those, despite being part of the specs. And it's not that they can't generate a stream with those, they also sometimes can't decode them (breaking the spec).

And finally, the hardware encoders are tuned for higher bitrate work. Ask them to do a 3MBps stream, they'll do fine. Ask them for a 30KBps stream, they'll make garbage most of the time.

Have you said this for Audio Codec I would have agreed. I do not know a single Smartphone Video Conferencing software that uses CPU encoding rather than hardware encoding. Neither WhatsApp or FaceTime, perhaps the largest of the two real time Video Call uses AV1.
Yeah, no production or large scale VC system is running software AV1 encoders on smartphones. You will drain a full phone battery in 1-2 hours of calls.

It just doesn’t make sense and will result in extraordinary power/battery drainage at best, or output that’s worse than hardware encoding.

The only way you could get AV1 to software encode in realtime AND low latency on a mid-range Android chip is by disabling or skipping nearly all of the compression/encoding features that make it good at low bitrate.

> Yeah, no production or large scale VC system is running software AV1 encoders on smartphones. You will drain a full phone battery in 1-2 hours of calls.

Yeah but, half jokingly, Zoom does that (draining the battery in an hour) already :P

So, status remains quo, the commons remain tragic, and glory to H.264 forever?
> tragic

H.264 isn't even that bad at all, if not the best depending on how you look at it. Our Internet bandwidth, both on the backend and front end on Mobile 5G is increasing with plenty more room to grow. While computation decoding and storage isn't.

i.e If bandwidth is infinite and free, and we are only optimising for decoding power usage. H.264 wins in a lot of this scenario.

At least until a better codec has widespread enough hardware support, I think.
Google Meet can do it. You don't want the full conference with AV1, just use it for very low bitrate scenarios with a high packet loss possibly. Phones are a good target system. And I know this is quite opposite to expectations.

It is a lot better to send a stable and visually ok stream with AV1 at 30KBps than fail to send a VP8 50KBps stream that is unusable anyway and is subject to twice as many packet lost than a lower bitrate solution.

It is possible they use AV1 in other scenarios now, but I left the team a while back now and I haven't checked what they are now using under the hood.

Anything running on a battery will need hardware acceleration
Even without encoding, as long as decoding is supported for AV2, streaming sites like Youtube can always transcode uploads. The encoder on mobile hardware is more of a nice bonus as long as we have an AV1 encoder available in the meantime.
Youtube is doing this now. Most semi popular videos have an AV1 transcode, something interesting is I've seen youtube chooses to use the AV1 format even on my macbook which doesn't have a hardware decoder, I had a look at the CPU usage and there is a 50% load on one thread on my M1, but aside from extra battery usage, this is basically negligible since I'm likely not doing any other CPU heavy tasks while watching video.