| HN Mirror

It does!

The major distinction (besides this being generalized to pluggable binary container formats) is that what Courgette does happens in the context of delta comparison between two known binaries, where the goal is to create a minimal delta patch that can move you from one known binary to another known binary (given a sufficiently-smart updater, which actually "contains" a lot of the information in a Kolmogorov-complexity sense.)

There are efficiencies you can gain in encoding, that only work in a one-to-one, known-to-known comparison context. (Think "interstitial frames" in video; or RLE'd 1-bit delta-sigma audio encoding.) In these contexts, you need known1 on-hand already, plus a patch specific to the pair of {known1, known2}, to be able to recover known2. (This is why, in streaming video where non-delta-compressed "keyframes" are rare, you can't seek to arbitrary locations in the stream without the decoded video turning into a garbled mess for a while: you're receiving patches, but you don't have the [correct, clean] known1s to apply them to.)

But these efficiencies don't apply in the context of a system like git, where you might be checking out any arbitrary version of the data at any time, with any [or no!] other arbitrary version(s) already cached or checked out. To enable people to efficiently grab the delta they need to do the particular, arbitrary version-to-version jumps they need to do, you'd either need to generate the power-set of all updates; or at least enough entries to populate a skip-list (as Microsoft apparently does for Windows updates! [1]).

The powerset of updates is impractical to generate+store, but gets clients the ability to perform arbitrary jumps in O(1) time. With a skip-list of updates, on the other hand, either the server or the client needs to compute its way through all the skips, to combine them into one final update — so each jump would take O(log N) "patch application" computation steps to resolve. (This is why it takes the Microsoft servers a while to figure out which Windows updates your computer needs!) Neither approach really makes sense for a system like Git, where Git repos are entirely mirrored to every client (so the clients would need to store all those update patches), and where git checkouts are something you're supposed to be able to do often and quickly, in the middle of your daily workflow.

Meanwhile, the sort of approach I outlined would be optimized instead for storing currently known binaries, in a way most amenable to inter-compression against future, unknown versions of the binary, without the need to store any explicit known-to-known patches, or to compute the composition of such patches.

Rather than seeking to compress together existing available things as well as possible, a known-to-unknown compression approach seeks to segment and normalize the data in such a way that future revisions of the same data, put through the same process, would generate lots of content-identical duplicate component objects, which would end up being dedup'ed away when stored in a Content-Addressable Store.

It's the difference between "meeting in the middle" and canonicalization. Levenshtein distance vs. document fingerprinting. RAID5 vs. a fountain-codec-encoded dataset. Or, to be very literal, it's streaming-window Huffman coding, vs. merkle-trie encoding.

This "normalize, to later deduplicate" known-to-unknown compression approach, is the approach that Git itself is built on — but it only works well when the files Git works with are decomposed into mostly-independent leaf-node chunks. My proposed thought here, is just that it wouldn't be impossible to translate other types of files, into "mostly-independent leaf-node chunks" at commit time, such that Git's approach would then be applicable to them.

[1] https://devblogs.microsoft.com/oldnewthing/20200212-00/?p=10...