Git does do compression on repos, but the fact that versioning repositories with (huge) data is still an open problem suggests that it is not the kind that fixes it. I might be mistaken, are you aware of any interesting compression methods applied to version control?
Doesn’t mean those deflate trees are optimal, as you can see with tools like OxiPNG optimizing the deflate compression to reduce png file sizes by about half.
The same optimization could be applied to git blobs in theory, it would be cool if there was a tool that did that.
——
My second point was more about if git was upgraded to add a new compression algorithm like zstd instead of Deflate, like switching the hash algorithm from SHA-1 to SHA-2 (though if I was in charge, I’d go with Blake3 because it’s far faster)
Fully agree. Compression in many cases removes the ability to diff easily, however. In a large dataset where, in terms of size, 1% of the original data undergoes changes, or new data the size of 1% of the original dataset is added, I think compressing does not compare with just deduplicating the unchanged 99% in terms of storage, but when speed is the #1 factor, the discussion is more nuanced. It might be interesting to have a combination of deduplication and better compression of the changes, in some form, to get the optimal tradeoff. Repo sizes in ML these days are high, I'm curious which repository compression techniques are being evaluated and deployed.