Hacker News new | ask | show | jobs
by illiac786 18 days ago
> HW encoders produce bad results at low bitrate.

Is that poor implementation or is it inherently harder to implement in hw encoders?

2 comments

There's a few reasons, I suspect fixed resource depth might be factor in poor hardware single pass encoding ...

What does limit them, though, is pseudo real time single pass pipelines.

I see the best encoding results from two pass - one fast run to work out the easy compress and hard compress parts of a video and then a second pass to get the optimal results on a stream that's already got a budget in mind for each section through the advantage of foresight as to what's left to do.

As someone else said, it's poor single pass encoding performance targeted for the tools used in real-time communications. This type of usage is "new" to hardware manufacturers and they poorly test it as it's easier to make a chip good enough for decoding the general case for watching your favorite movie platform than do something comprehensive.

One aspect of real-time encoding is that the frames are not ordered or structured the same linear way as they used to be in older format. Now, we have temporal and spatial encoding, which allows for better frame drops or efficiency or a stream that is decodable at multiple resolutions at the same time.

An example of temporal encoding is that you have a sequence of frames at 15fps (T0) that are all referencing the previous one, and sometimes an I-frame that is a full independent picture you can start decoding from. Then, you can have another temporal layer (T1) , where for every frame at the base 15 fps layer (T0), you insert a new frame that depends on it. You end up having a 30 fps stream! And if your network connection is worse, or you hardware can't keep up, you can drop the T1 layer and only use the T0 layer. This works great for real-time! In the specs, you could have more layers with more complex dependency chains, but 3 layers is as high as you want to go.

Spatial encoding is a bit different, you will have frame at the highest resolution, but they reference another frame at half the resolution (who may also do the same). Each higher layer means just adding more details over the smaller size frame that you have at the base. To decode an image, you need to have all the frames available. This can also be combined with the temporal encoding above. While this isn't useful for a 1-1 communication, in conference rooms, it's a great optimization as while you may send your full HD picture to the server, you may not want to send that to everyone when you're just a thumbnail who is not actively speaking. So the conference server will not send the full HD picture, but the lower resolution only. And since you don't want to do the encoding on the server (it's expensive, slow and you need to trust the intermediate service with your secret stuff), doing spatial encoding on the client side is better.

Those techniques are all advanced ones that would be used if available universally. Unfortunately, a lot of hardware decoders choke on those, despite being part of the specs. And it's not that they can't generate a stream with those, they also sometimes can't decode them (breaking the spec).

And finally, the hardware encoders are tuned for higher bitrate work. Ask them to do a 3MBps stream, they'll do fine. Ask them for a 30KBps stream, they'll make garbage most of the time.