Hacker News new | ask | show | jobs
by tonic_section 2111 days ago
Hi everyone, I've been working on an implementation of a model for learnable image compression together with general support for neural image compression in PyTorch. You can try it out directly and compress your own images in Google Colab [1] or checkout the source on Github [2].

This project is based on the paper "High-Fidelity Image Compression" by Mentzer et. al. [3] - this was one of the most interesting papers I've read this year! The model is capable of compressing images of arbitrary size and resolution to bitrates competitive with state-of-the-art compression methods while maintaining a very high perceptual quality. At a high-level, the model jointly trains an autoencoding architecture together with a GAN-like component to encourage faithful reconstructions, combined with a hierarchical probability model to perform the entropy coding.

What's interesting is that the model avoids compression artifacts associated with standard image codecs by subsampling high-frequency detail in the image while preserving the global features of the image very well - for example, the model learns to sacrifice faithful reconstruction of e.g. faces and writing and use these 'bits' in other places to keep the overall bitrate low.

The overall model is around 700MB - so transmitting the model wouldn't be particularly feasible, and the idea is that both the sender and receiver have access to the model, and can transmit the compressed messages between themselves.

If you have any questions or notice something weird I'd be more than happy to address them.

---

[1] Colab Demo: https://colab.research.google.com/github/Justin-Tan/high-fid...

[2]: Github: https://github.com/Justin-Tan/high-fidelity-generative-compr...

[3]: Original paper: https://hific.github.io/

[4]: Sample reconstructions: https://github.com/Justin-Tan/high-fidelity-generative-compr...

5 comments

Would this work for a lossless / near lossless approach by having a final pass storing a delta between the compressed image and the original pixels, or do you think they diverge too much on a purely pixel-for-pixel basis for this to be valuable?
The model uses a GAN which does not learn the exact PDF. So not lossless, but as you can see from the images it gets extremely visually accurate results.

From the README

> The generator is trained to achieve realistic and not exact reconstruction. It may synthesize certain portions of a given image to remove artifacts associated with lossy compression. Therefore, in theory images which are compressed and decoded may be arbitrarily different from the input. This precludes usage for sensitive applications. An important caveat from the authors is reproduced here:

> "Therefore, we emphasize that our method is not suitable for sensitive image contents, such as, e.g., storing medical images, or important documents."

> "Therefore, we emphasize that our method is not suitable for sensitive image contents, such as, e.g., storing medical images, or important documents."

As an example of this going wrong previously, xerox had once implemented compression based on deduplicating duplicate parts of documents. Obviously numbers contains tons of duplicate symbols (digits). The problem was that the scanner software deduplicated different numbers with each other, leading to wrong numbers.

http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...

>The model uses a GAN which does not learn the exact PDF. So not lossless, but as you can see from the images it gets extremely visually accurate results.

Yes, I understand this is a lossy compression method - what I was proposing is to have the compressor as a final pass take the predicted output image, and subtract it from the original pixels. This gives you a delta between the predicted image and the original image. You can then compress that delta losslessly, and store it alongside the output of this model - if the predicted image is close enough to the original image then you've significantly reduced the amount of entropy in the delta, making it highly compressible.

This is how some domain-specific lossless compression algorithms work, e.g. DTS-HD Master Audio

Yes, the model is not lossless as this would require learning the PDF in the original input space.

However, the model does learn a conditional probability distribution over a lower-dimensional representation of the original image - this is unavoidable as entropy coding requires a distribution over discrete symbols. The GAN is almost auxiliary and not a central component of the model - in fact, you can get very good results without the GAN, but does seem to result in visually superior reconstructions.

I suspect if lossless reconstruction was your goal, you would want a different architecture. You would want the model to give you a conditional probability distribution for each pixel, conditioned on all previous pixels, so you could use a regular entropy coder to encode exact data.
As u/londons_explore mentioned, in theory you can train a model for lossless reconstruction - there are several papers about this, e.g. [1] is a good recent example. Lossless compressors need to learn a probability distribution over each input pixel, which amounts to maximum likelihood estimation in the original image space.

The model in the demo is a lossy compression method because it first projects the input to a lower dimensional space and performs quantization of this representation to integer values so the result can be ultimately entropy coded. It uses the mean-scale hyperprior model introduced in [1] to estimate the necessary probability distributions in the lower-dimensional space for entropy coding.

[1]: https://arxiv.org/abs/1811.12817 [2]: https://arxiv.org/abs/1802.01436

> or notice something weird

> [4]: Sample reconstructions

The text in the reconstructed image in the third row looks different, the word phonomat is quite garbled, information looks a bit funny.

Yeah, high frequency detail such as facial features for faraway figures or text tend to get washed out after compression - this is probably due to a couple reasons: 1) The training dataset contains relatively few pictures including text, 2) high-frequency detail is too expensive to encode and the model learns to forgo encoding this in favor of more 'important' features such as shapes, colors, etc.
Sorry, looks like both GDrive and Zenodo have exceeded the temporary download quotas, so the model checkpoints aren't available currently... If anyone has any solutions on how to publicly host model weights (~2 GB) please let me know!
I would recommend to link to a site where some example images can be easily compared (ideally with a viewer that offers toggling between them in-place to make it easy to see the differences), instead of directly linking to a colab that does heavy computations.

I assume most people just want to see the images, forcing them to recompute them is a waste of resources. Even just storing a version of the colab with the results present would help a lot.

torrent? If you create a torrent I can seed it for a while
I eventually shifted the models to S3, but thanks for the offer.
How do you handle the traffic? If every reader who clicks the link and runs the colab costs you 15 cents for traffic, that's got to get expensive unless you have some sort of "free traffic" deal or someone else is paying for it?
I think S3 permits up to 20k requests before they start billing IIRC.
I hope I'm wrong, but I believe that's for the cost of the request processing, not covering the traffic.

The free tier for traffic is "15GB of Data Transfer Out", after that it's 9 cents per GB. https://aws.amazon.com/s3/pricing/?nc1=h_ls (under "Data transfer"). Check your AWS bill!

Huh. So a DVD at 4.7gb would go from containing 5000 5mb photos to a 700mb model + 80,000 photos.
Does this have issues with out of domain images?
The model was trained on a fairly image (~1e6) dataset of diverse high-resolution natural images (the Openimages dataset) - so there was no particular training domain, and generalizes to images of arbitrary size/resolution/content well. There is a larger set of samples generated using the medium bitrate model which can be viewed in this Google Drive: https://drive.google.com/drive/folders/1lH1pTmekC1jL-gPi1fhE...

One interesting failure model is that images dominated by high-frequency detail require a relatively large bitrate to store - see e.g. the last example in the Github README with the weird brickwork. Even though the model was trained to produce compressed representations with a soft constraint on the maximum bitrate, the filesize of the representation for this particular image is something like 60% above the nominal maximum.