Sounds like a straightforward time-space tradeoff: if you have the compressed layers sitting around when you need them, you can avoid the expense and time of compressing them.
I'm not sure about the fastest macbook disk access, but even with NVMe storage I've found lz4 to be faster than the disk. That is (it's hard to say this exactly correct) compressed content gets read/written FASTER than uncompressed content because fewer bytes need to transit the disk interface and the CPU is able to compress/decompress significantly faster than data is able to go through whatever disk bus you've got.
If you have many cores and have the right optimizations in place the bottleneck for lz4 decompression is RAM throughput which is always going to beat whatever fancy disk setup you have.
But yes on the extreme end absolutely there's a point where lz4 stops making sense, but also most of us aren't trying to max out a 128 core postgres server or whatever.
Recompressing should be guaranteed deterministic. It’s the packing/unpacking of tar archives to/from directories on disk that leads to the non-determinism (such as timestamps and ownership metadata). If the tar is left intact, both zstd and gzip should produce byte for byte identical outputs given the same compression parameters.
You are correct; I confused archiving with compression. However, even considering only the compression process, same compression parameters cannot be guaranteed, as it is unknown which compression parameters the image publisher used.
Thats true. And regardless of compressed vs regular tar, I think the OCI format working with opaque archives is extremely limiting. I hope the industry will eventually redesign to use content addressable storage per file and have metadata to describe the layer/disk layout instead. That would allow per file deduplication, and we can use tar for just bulk transfer over the wire, rather than using tar for the data at rest.
containerd 2.3 has support for erofs which does a direct import of the layer.
It can even convert the tar based layers to erofs, faster than extracting the tar normally.
Also looking at block-based content store so that blocks can be deduped across images.
Thanks for the correction. I did mean given the same tooling version/parameters, but (as you and others pointed out) preserving and recreating that state is not at all straightforward.
Zstd for example only promises determinism on the same version of the library. I've personally seen the hashes mutate between pull and export. Things like tar padding also make a difference. Really, the thing to do is to hash on the _uncompressed_ data and let compression be a transport/registry detail. That's what I've done, at least.
Yes, compression being part of the OCI image's digest was (in hindsight) a poor decision. _Technically_ OCI images allow uncompressed layers, and the layers could be included without compression (and transport compression to be used); this would allow layers to be fully reproducible. We explored some options to do this (and made some preparations; https://github.com/containerd/containerd/pull/8166), but also discovered that various implementations of registry clients didn't handle transport-compression correctly (https://github.com/distribution/distribution/pull/3754), which could result in client either pulling the full, uncompressed, content, or image validation failing.