Hi everyone, I've been working on an implementation of a model for learnable image compression together with general support for neural image compression in PyTorch. You can try it out directly and compress your own images in Google Colab [1] or checkout the source on Github [2].
This project is based on the paper "High-Fidelity Image Compression" by Mentzer et. al. [3] - this was one of the most interesting papers I've read this year! The model is capable of compressing images of arbitrary size and resolution to bitrates competitive with state-of-the-art compression methods while maintaining a very high perceptual quality. At a high-level, the model jointly trains an autoencoding architecture together with a GAN-like component to encourage faithful reconstructions, combined with a hierarchical probability model to perform the entropy coding.
What's interesting is that the model avoids compression artifacts associated with standard image codecs by subsampling high-frequency detail in the image while preserving the global features of the image very well - for example, the model learns to sacrifice faithful reconstruction of e.g. faces and writing and use these 'bits' in other places to keep the overall bitrate low.
The overall model is around 700MB - so transmitting the model wouldn't be particularly feasible, and the idea is that both the sender and receiver have access to the model, and can transmit the compressed messages between themselves.
If you have any questions or notice something weird I'd be more than happy to address them.
Would this work for a lossless / near lossless approach by having a final pass storing a delta between the compressed image and the original pixels, or do you think they diverge too much on a purely pixel-for-pixel basis for this to be valuable?
The model uses a GAN which does not learn the exact PDF. So not lossless, but as you can see from the images it gets extremely visually accurate results.
From the README
> The generator is trained to achieve realistic and not exact reconstruction. It may synthesize certain portions of a given image to remove artifacts associated with lossy compression. Therefore, in theory images which are compressed and decoded may be arbitrarily different from the input. This precludes usage for sensitive applications. An important caveat from the authors is reproduced here:
> "Therefore, we emphasize that our method is not suitable for sensitive image contents, such as, e.g., storing medical images, or important documents."
> "Therefore, we emphasize that our method is not suitable for sensitive image contents, such as, e.g., storing medical images, or important documents."
As an example of this going wrong previously, xerox had once implemented compression based on deduplicating duplicate parts of documents. Obviously numbers contains tons of duplicate symbols (digits). The problem was that the scanner software deduplicated different numbers with each other, leading to wrong numbers.
>The model uses a GAN which does not learn the exact PDF. So not lossless, but as you can see from the images it gets extremely visually accurate results.
Yes, I understand this is a lossy compression method - what I was proposing is to have the compressor as a final pass take the predicted output image, and subtract it from the original pixels. This gives you a delta between the predicted image and the original image. You can then compress that delta losslessly, and store it alongside the output of this model - if the predicted image is close enough to the original image then you've significantly reduced the amount of entropy in the delta, making it highly compressible.
This is how some domain-specific lossless compression algorithms work, e.g. DTS-HD Master Audio
Yes, the model is not lossless as this would require learning the PDF in the original input space.
However, the model does learn a conditional probability distribution over a lower-dimensional representation of the original image - this is unavoidable as entropy coding requires a distribution over discrete symbols. The GAN is almost auxiliary and not a central component of the model - in fact, you can get very good results without the GAN, but does seem to result in visually superior reconstructions.
I suspect if lossless reconstruction was your goal, you would want a different architecture. You would want the model to give you a conditional probability distribution for each pixel, conditioned on all previous pixels, so you could use a regular entropy coder to encode exact data.
As u/londons_explore mentioned, in theory you can train a model for lossless reconstruction - there are several papers about this, e.g. [1] is a good recent example. Lossless compressors need to learn a probability distribution over each input pixel, which amounts to maximum likelihood estimation in the original image space.
The model in the demo is a lossy compression method because it first projects the input to a lower dimensional space and performs quantization of this representation to integer values so the result can be ultimately entropy coded. It uses the mean-scale hyperprior model introduced in [1] to estimate the necessary probability distributions in the lower-dimensional space for entropy coding.
Yeah, high frequency detail such as facial features for faraway figures or text tend to get washed out after compression - this is probably due to a couple reasons: 1) The training dataset contains relatively few pictures including text, 2) high-frequency detail is too expensive to encode and the model learns to forgo encoding this in favor of more 'important' features such as shapes, colors, etc.
Sorry, looks like both GDrive and Zenodo have exceeded the temporary download quotas, so the model checkpoints aren't available currently... If anyone has any solutions on how to publicly host model weights (~2 GB) please let me know!
I would recommend to link to a site where some example images can be easily compared (ideally with a viewer that offers toggling between them in-place to make it easy to see the differences), instead of directly linking to a colab that does heavy computations.
I assume most people just want to see the images, forcing them to recompute them is a waste of resources. Even just storing a version of the colab with the results present would help a lot.
How do you handle the traffic? If every reader who clicks the link and runs the colab costs you 15 cents for traffic, that's got to get expensive unless you have some sort of "free traffic" deal or someone else is paying for it?
The model was trained on a fairly image (~1e6) dataset of diverse high-resolution natural images (the Openimages dataset) - so there was no particular training domain, and generalizes to images of arbitrary size/resolution/content well. There is a larger set of samples generated using the medium bitrate model which can be viewed in this Google Drive: https://drive.google.com/drive/folders/1lH1pTmekC1jL-gPi1fhE...
One interesting failure model is that images dominated by high-frequency detail require a relatively large bitrate to store - see e.g. the last example in the Github README with the weird brickwork. Even though the model was trained to produce compressed representations with a soft constraint on the maximum bitrate, the filesize of the representation for this particular image is something like 60% above the nominal maximum.
Not an compression expert, but my eyes have been trained to ignore color gradient issues and minor pixelation as long as the outline of the shapes is clearly defined. This approach while doing better job on preserving detail in colors and avoids pixelation, it distorts significantly the shapes themselves (see the clock on the last example). It makes the images seem like google map 3D renders of shorts. How finely can you tune the target compression ratio ? Maybe with a less aggressive target these would not be that evident ?
During training, you can set a target bitrate by heavily penalizing examples which exceed the target rate in the rate-distortion objective - so the model should learn to produce compressed representations at or below this bitrate. However, this constraint is only enforced on aggregate throughout the entire dataset - like many ML systems, there is no guarantee of behaviour for individual examples, either within or outside the training set. Despite this, the model appears to respect the target rate well, even on out-of-sample images.
One shortcoming is that this current model is non-adaptive - which means that the target rate is fixed. So to achieve different target compression rates you would have to train multiple models in different rate regimes. In the Colab demo there is the option to select between 3 different models trained with a target bits-per-pixel (bpp) rate at 0.14bpp, 0.30bpp, and 0.45bpp, respectively - higher rates correspond to more higher-fidelity reconstructions, at the expense of a lower compression ratio. The default is the `HiFIC-med` model (and this is what the all samples in the README were generated with), but the model trained at the highest bitrate should have less obvious imperfections.
There's also an aspect to the distortion that can be attributed to the entropy coding process rather than the model itself - currently the system clips values outside a certain probability range, resulting in artificial distortion - a fix is in the pipeline though.
There are lots of random spots on the image, and the brightness level changes totally.
Sure, 5232 kB to 124 kB is impressive, but people would probably prefer a badly compressed JPEG over this, since at least JPEG artifact is predictable (and if image isn't displayed in 100%, the artifact would be less obvious, unlike brightness change and spots in this result).
Edit: I just saw the result in https://hific.github.io/ for the same picture, but that one has none of these flaws (no brightness change, no weird spots here and there) with even smaller filesize. Why?
Hey, thanks for bringing the brightness issue to my attention - turns out I wasn't normalizing the output correctly - I just pushed a fix and the output images don't have the brightness change now.
As for the random spots, that's an artifact of the entropy coding algorithm. In principle this is lossless but there is some distortion because I'm using a custom vectorized version of an rANS encoder and it's hard to encode overflow values in a vectorized fashion, I'm working on this though. If you can live with really slow decoding times (2-3mins) then you can disable vectorization to eliminate these small imperfections entirely.
As for the comparison to the official model, that's mainly because of compute constraints v. Google (this is just my weekend project). My model uses a smaller architecture and was trained for only 4e5 steps versus the 2e6 steps they reported in the paper - even then it took 4+ days on AWS! The model is also trained on the Openimages dataset, which is presumably much smaller and more noisy than the massive internal dataset Google used.
It's a research demo and you seem to frame it in a negative light almost exclusively. Not that your remarks aren't valid (they are). Seems a bit flippant.
I would be interested in a comparision with a neural JPEG decoder that tries to restore the original image as much as possible.
It's incredibly hard to change the default file format on the web, but there's an opportunity to switch libjpeg to a decoder with much more realistic output images.
What was the error? I tried to make the demo notebook as robust as possible - you should be able to execute all cells in sequence once then execute cells out of sequence etc. without trouble, but it's hard to legislate for errors in Jupyter-like notebooks sometimes.
The models aren't downloading correctly. The content of the '*.pt' files says 'Google Drive - Quota exceeded'. I guess too many people have tried downloading the files from your drive.
One solution is to download (and upload to Colab) the models manually in /content/checkpoint/
GDrive doesn't download the model checkpoints correctly sometimes, leading to the following error:
```
# Setup model
I get an error in the function call 'prepare_model'
UnpicklingError: invalid load key, '<'.
```
Try rerunning the download cell if you experience this - the models downloaded should be around 1.5-2GB, so if the checkpoints are 100kB in size, the download's gone wrong.
Oh, thanks so much for your replies. Really appreciate your attitude of taking that seriously and trying to mitigate. Not saying that from my point of view of thinking my error is somehow important, just from my experience putting work on HN, and knowing what it feels like when people let you know something goes wrong. Well done!
This project is based on the paper "High-Fidelity Image Compression" by Mentzer et. al. [3] - this was one of the most interesting papers I've read this year! The model is capable of compressing images of arbitrary size and resolution to bitrates competitive with state-of-the-art compression methods while maintaining a very high perceptual quality. At a high-level, the model jointly trains an autoencoding architecture together with a GAN-like component to encourage faithful reconstructions, combined with a hierarchical probability model to perform the entropy coding.
What's interesting is that the model avoids compression artifacts associated with standard image codecs by subsampling high-frequency detail in the image while preserving the global features of the image very well - for example, the model learns to sacrifice faithful reconstruction of e.g. faces and writing and use these 'bits' in other places to keep the overall bitrate low.
The overall model is around 700MB - so transmitting the model wouldn't be particularly feasible, and the idea is that both the sender and receiver have access to the model, and can transmit the compressed messages between themselves.
If you have any questions or notice something weird I'd be more than happy to address them.
---
[1] Colab Demo: https://colab.research.google.com/github/Justin-Tan/high-fid...
[2]: Github: https://github.com/Justin-Tan/high-fidelity-generative-compr...
[3]: Original paper: https://hific.github.io/
[4]: Sample reconstructions: https://github.com/Justin-Tan/high-fidelity-generative-compr...