Hacker News new | ask | show | jobs
Decoding AVIF: Deep dive with cats and imgproxy (evilmartians.com)
56 points by progapandist 1773 days ago
5 comments

I wish AVIF had a simpler web-oriented profile. It has inherited bloat from HEIF, which is built on a tower of specs, which were designed to be a catch-all of all features for everyone, including cameras, photo management, and editing. Because it recycles a video codec and old MP4 specs, there's also a ton of compatibility with legacy video tools — in a brand new image format.

AVIF/HEIF has hundreds of features that decoders are theoretically supposed to implement. Many of ISOBMFF "boxes" also have multiple versions (typically 16/32/64-bit versions or "oops we forgot to add a field" version). Does every decoder really need to support all of them? It's such a waste of effort, and file bloat. Browsers only care about getting pixels on screen. They don't have UIs nor APIs to browse through photo bursts, bracketed exposures, or an infrared channel (or is every AVIF viewer supposed to implement these now?)

And some features are even dangerous: lossless cropping. Great for your camera roll app, but for publishing on the Web it's a huge footgun. You could think you've cropped private info out of the picture you've shared, but randos on the internet can uncrop your HEIF pictures.

I had a really good go at reading this and trying to understand it but I feel I don't understand a whole lot more about image decoding/encoding than before I started. Like I get the core concepts of key frames, motion vectors and such on a high level but if you asked me to actually create a decoder I wouldn't have a clue where to start.

I feel like I would need a full hour long video on each paragraph of this post to really understand it.

We made a mpeg-ish encoder and decoder at uni, and it was surprisingly simple.

Keyframes were just jpegs. Then for intraframes, you first found the motion vectors for each 8x8 block, then generated the predicted intraframe from the keyframe and motion vector. Then you subtracted the prediction off the keyframe. The resulting "prediction error image" was then simply jpeg encoded as the intraframe, and appended to the output after the motion vectors.

Decoding was reverse, reconstruct the predicted intraframe from the previous keyframe and motion vectors, and add back the prediction error image.

Might be glossing over something as it was over a decade ago but should be about the gist of it.

We played with different algorithms for finding motion vectors and such, including accelerating it with GPGPU.

Really fun project, and assuming you use an existing jpeg library, not at all big or difficult.

> simply jpeg encoded as the intraframe

In case it wasn't obvious, the error image can of course have negative values which jpeg can't handle. So you add a bias of +128 and clamp the biased error to [0, 255].

During decoding you simply subtract the bias when adding back the error image.

Last year, I did a deep dive into Fabrice Bellard's obfuscated image decoder (http://www.ioccc.org/2018/bellard/hint.html) entry for the 2018 IOCCC. His code implements of a lot of these decoding techniques in just 4KB of source, including a stringified 128x128 test image.

You can find my deobfuscation and detailed explanation of his program here: http://eastfarthing.com/blog/2020-09-14-decoder/

I wonder what are modern use-cases for 4:2:2 sampling? Is it simply a historic relic, which gets ported from codec to codec?
Unlike 4:2:0 it works well as a packed format (e.g. Y0 Cb Y1 Cr), and unlike 4:4:4 it's a simple 2 bytes per pixel (packed 4:4:4 either uses an uneven 3 bytes per pixel or wastes 1 byte per pixel...)

Which led to a lot of simple and professional HW/SW being designed for packed 4:2:2, so codecs support 4:2:2 to fit into professional pipelines.

Thank you! Yeah, the ability to handle pixel data in chunks of 4 bytes can be useful in certain context.
> I wonder what are modern use-cases for 4:2:2 sampling? Is it simply a historic relic, which gets ported from codec to codec?

https://en.wikipedia.org/wiki/Chroma_subsampling nicely explains this clever hack's continuing relevance.

TLDR: Chroma subsampling (as done in the 4:2:2 Y'CbCr color space) is used to improve video encoding efficiency by taking advantage of the "human visual system's lower acuity for color differences than for luminance".

I do understand [1] why 4:2:0 sampling is used, my question was strictly about 4:2:2.

[1]: https://news.ycombinator.com/item?id=28204865

This is a very informative article.

BTW this cat should now and forever be called Lena.

"Commonly, three numbers are used to specify downsampling:

    the first is always 4, don’t ask me why"
Wow. Someone going to this much detail on explaining how a video/image codec works, and cannot bother learning what the numbers of chroma subsampling mean?

The first number represents the luminance.[0] Even if they know the first number represents luminance, the "don't ask me why" is just horrible on its own. The detail in the image is preserved through the luminance channel. The subsampling in the chroma is much less perceptable to humans, but more more noticeable in the luminance. Therefore, some very smart people learned to cheat the data saved for chroma, but not the luminance. "don't ask me why" in detailed write ups is just bad in so many ways.

[0]https://en.wikipedia.org/wiki/Chroma_subsampling

Poynton has a pretty plausible-sounding explanation here (https://poynton.ca/PDFs/Chroma_subsampling_notation.pdf):

"The commonly used leading digit of 4 is a historical reference to a sample rate roughly four times the NTSC or PAL color subcarrier frequency; the notation originated when subcarrier-locked sampling was under discussion for component video. Upon the adoption of component video sampling at 13.5 MHz, the first digit came to specify luma sample rate relative to 3 3⁄8 MHz. HDTV was once supposed to be described as 22:11:11! Since then, the leading digit has – thank-fully – come to be relative to the sample rate in use. Until recently, the initial digit was always 4, since all chroma ratios have been powers of two – 4, 2, or 1. However, 3:1:1 subsampling has been commercialized in an HDTV production system (Sony’s HDCAM), so 3 may now appear as the leading digit. By convention, a leading digit of 2 is never used."

And here is lots of detailed history: https://tech.ebu.ch/docs/techreview/trev_304-rec601_wood.pdf , including a lot of debate in the late 70's about "three-times sub-carrier (3fsc) versus four-times sub-carrier (4fsc) sampling." The victory for team "4" is, I think, why that's the leading digit, even though they ended up compromising on not-quite-4 in the end.

Edit: While the reason for 2 vertical lines (interlacing) is correct, it seems that the 4 horizontal lines is from a compromise between NTSC and 525-line systems' frequency (explained in length here: https://news.ycombinator.com/item?id=28203942, thank you keithwinstein), standardised as Rec. 601. One of the proposals is 3:2:x, but it was both worse analogue-speaking and harder digitally-speaking. 3:1:x was used for HDVS, which was the uncompressed storage format of Sony. It was essentially inherited for the MUSE, D-MAC and unreleased prototype American Analog HD systems (note that ATSC was essentially a reboot, trying to outcompete MUSE and DVB).

Okay, the real reason, as far as the bundles of paper I have* is accurate, is that digital chroma subsampling was first invented for MUSE, a Japanese analogue HD video standard (with pre-broadcast digital components). They chose four for horizontal because it's relatively easy to manipulate using their digital systems at the time and two for vertical so that it's easy to handle interlacing stuff. Unfortunately, I'm not Sony or NHK so I can't say for certain why not eight or any other powers of two. Also, Americans (aka the SMPTE) set the 1,080 lines (the Japanese standard is 1,025), the 16:9 compromise (between the European and Japanese 15:9 and cinema 21:9) and the "limited RGB" dilemma that is experienced in digital video systems (that's literally from the days of NTSC signalling!). Both the Japanese NHK/Sony MUSE system and the British IBA (adopted as European) D-MAC system uses the full-range 8-bit system that is used for JPEG (pre-broadcast to analogue, of course).

Analogous to this, the reason why CD audio is 44,100 Hz is because that's the commonality between NTSC (System M, 525-line 480-visible 60-Hz) and 625-line (576-visible 50-Hz) systems. Digital audio was literally stored on U-matic systems at the time, and it was initially only 14-bit PCM rather than the 16-bit PCM of CDs.

* or rather, my employer's mini-library.

I don't think this is quite right. My understanding is that 4:2:2 chroma subsampling for Rec. 601 video (and the choice of sampling at 4x the chroma sub-carrier) came around in the days of 525/59.94 vs. 625/50 systems, before even the Japanese analog HDTV systems -- see my link above (https://tech.ebu.ch/docs/techreview/trev_304-rec601_wood.pdf).
Thank you for pointing this - I've checked engineering notes from that time and it was indeed mentioned (it was even mentioned in proto-D-MAC from 1981), I'll correct the post. I wonder though how I missed that, worse forgetting Rec. 601!
I don't think you're being generous with the author's statement, especially since this is in the section within which he's describing chroma subsampling. The author is stating "We use 4 as a convention. why is that the convention? No one really knows". That seems accurate to me. Do you have a clearer answer? Your Wikipedia link doesn't provide any enlightment AFAICT, although maybe I missed explanation?
Just before this section the author discusses how the image is broken down into blocks. This section is where the definition of the 4s could have come from, but they left out, for brevity's sake I'm assuming, how those blocks are shaped.

"Now, let's break it down the differences between 4:4:4; 4:2:2 and 4:2:0:

The number of pixels that share color is determined by what type of chroma subsampling it is. Each sample is defined by a block of 8 pixels. The first number refers to the size of the sample and its pattern, which is typically 4 pixels wide. The second number refers to how many pixels in the top row will receive color or chroma sampling. The third number shows how many pixels on the bottom row will receive chroma samples"[0]

The block sizes and sub-sampling methods are also why there are warnings issued when trying to scale an image when the dimensions are not divisible by the block sizes. If you try to scale to an odd number, then the sampling within the blocks is broken. If you scale to a number not divisible evenly by the largest block sizes requested, then you also get issues.

[0] https://blog.westpennwire.com/what-is-chroma-subsampling

Just after he says: "4:2:0 is the most popular case. Four luma samples per one chroma" so he does understand and write what it means. That does not explain why the value is 4.

I assume it is because with 4 you have 3 different subsampling ratios (if you want to keep factors of two, which you typically want to keep algorithms simple)

Not sure that explains why the first number has to be 4, which was their point.
Then they did not look/research very hard. See my response above. I provide a link to someone else's blog that was easily found with a DDG search.
You have not explained why the first number is always 4. (In fact, it's not always 4, it just usually is.)
One thing I never understood is why _downsampling_ is the most efficient way to compress the data about chroma into fewer bits while maximizing perceptual accuracy. It really seems like for any given target bitrate for the chroma data, there should always be a more efficient compression scheme available than simply throwing out 3/4 of the pixels and running compression algorithms on the rest. Surely modern compression can do better with a continuous low pass filter or a adaptive compression scheme that focuses data on interesting edges or something? Maybe someone here can better explain the intuition for this. I'm similarly curious for resolution in general (i.e. why does 480p upsampled ever look better than 1080p at the same bitrate) but chroma seems like a good place to start.
>Surely modern compression can do better

I "surely" look forward to your Show HN write up on your new compression algorithm. We've been iteratively getting better at compression for some time now. It seems like everytime it looks like we've wrung every bit out of DCT, someone comes up with some a little more clever. Wavelets looked promising, but never took off.

>why does 480p upsampled ever look better than 1080p at the same bitrate

That's a very vague question. Are you stating that you think 480p upsampled to 1080p at 1.5Mbps looks better than a source at 1080p at 1.5Mbps? I have a hard time believing this to be true.

To understand why the chroma is sub-sampled and not the luminance has to do with how the cones/rods in the eyes work. There's a lot of things you can get away with (or trick if you will) the brain in what it is seeing. Is it better to lose half the height or half the width? Is it better loose more red than green or blue?

JPEG XL doesn't perform chroma subsampling in its native color space of XYB. https://cloudinary.com/blog/how_jpeg_xl_compares_to_other_im...
It makes sense not only from biological point of view as noted in the article, but also from technological as well. Almost all color cameras use Color Filter Arrays [1], meaning that for WxH resolution you don't get WxHx3 pixel values as you would expect from RGB images which you usually consume, but only WxH (i.e. 2/3rds of RGB image data is generated, not measured). With 4:4:4 sampling you have 12 values per 2x2 block, even though only 4 values have been measured by camera for it. Meanwhile with 4:2:0 sub-sampling you have 6 values, which is still bigger than 4, but quite convenient for processing in Y-based color spaces.

[1]: https://en.wikipedia.org/wiki/Color_filter_array

I have had the same thought. Why not do away with chroma subsampling and just compress the chroma planes more heavily than the luma plane? Does heavy compression perform worse than just throwing away 3/4 of the data?