| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Filligree 85 days ago
	Why would I need the compressed layers?

3 comments

XYen0n 85 days ago

The OCI manifest references the hashes of these compressed layers, and re-compressing them does not guarantee obtaining the same hash

link

flakes 85 days ago

Recompressing should be guaranteed deterministic. It’s the packing/unpacking of tar archives to/from directories on disk that leads to the non-determinism (such as timestamps and ownership metadata). If the tar is left intact, both zstd and gzip should produce byte for byte identical outputs given the same compression parameters.

link

XYen0n 85 days ago

You are correct; I confused archiving with compression. However, even considering only the compression process, same compression parameters cannot be guaranteed, as it is unknown which compression parameters the image publisher used.

link

flakes 85 days ago

Thats true. And regardless of compressed vs regular tar, I think the OCI format working with opaque archives is extremely limiting. I hope the industry will eventually redesign to use content addressable storage per file and have metadata to describe the layer/disk layout instead. That would allow per file deduplication, and we can use tar for just bulk transfer over the wire, rather than using tar for the data at rest.

link

cpuguy83 85 days ago

containerd 2.3 has support for erofs which does a direct import of the layer. It can even convert the tar based layers to erofs, faster than extracting the tar normally.

Also looking at block-based content store so that blocks can be deduped across images.

link

cpuguy83 85 days ago

That is not correct. You would have to use the same compression tool (and likely version) for this to match.

Old docker discarded the compressed bits but kept some metadata about the the so it can at least recreate the tar.

It also recreated the manifest o push.

link

flakes 85 days ago

Thanks for the correction. I did mean given the same tooling version/parameters, but (as you and others pointed out) preserving and recreating that state is not at all straightforward.

link

mort96 85 days ago

If that's the purpose, couldn't you store the hash and throw away the compressed image?

(As others said, compression is deterministic for the same algorithm, parameters and input data)

link

a_t48 85 days ago

Zstd for example only promises determinism on the same version of the library. I've personally seen the hashes mutate between pull and export. Things like tar padding also make a difference. Really, the thing to do is to hash on the _uncompressed_ data and let compression be a transport/registry detail. That's what I've done, at least.

link

mort96 85 days ago

I didn't know that about zstd, that's a bit unfortunate.

Tar isn't related here though, we're talking about compression not archival formats

link

thaJeztah 85 days ago

Yes, compression being part of the OCI image's digest was (in hindsight) a poor decision. _Technically_ OCI images allow uncompressed layers, and the layers could be included without compression (and transport compression to be used); this would allow layers to be fully reproducible. We explored some options to do this (and made some preparations; https://github.com/containerd/containerd/pull/8166), but also discovered that various implementations of registry clients didn't handle transport-compression correctly (https://github.com/distribution/distribution/pull/3754), which could result in client either pulling the full, uncompressed, content, or image validation failing.

link

a_t48 85 days ago

For my registry fork/custom pull client I hash on the uncompressed content and store as compressed under the uncompressed digest. This lets me have my cake and eat it, too - compression free digests, smaller storage costs, be able to set consistent compression settings, have the ability to spend extra CPU to recompress on the backend without breaking hashes, etc. I control both pull client and registry, so it works.

link

cpuguy83 85 days ago

The whole entire reason is compression is not deterministic across tooling.

Pushing

What about pushing? Computers are fast enough to compress stuff as it's being transmitted, you don't need to store the compressed copy anywhere...

link

cryptonym 85 days ago

To save disk space /s

link