| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by KingLancelot 1139 days ago
	Honestly, it’d be nice if there was like a PNGCrush for git repos. Or maybe even if Git offered zstd compression would be cool too.

1 comments

eliomattia 1139 days ago

Git does do compression on repos, but the fact that versioning repositories with (huge) data is still an open problem suggests that it is not the kind that fixes it. I might be mistaken, are you aware of any interesting compression methods applied to version control?

link

KingLancelot 1139 days ago

I know git uses deflate hence my first paragraph.

Doesn’t mean those deflate trees are optimal, as you can see with tools like OxiPNG optimizing the deflate compression to reduce png file sizes by about half.

The same optimization could be applied to git blobs in theory, it would be cool if there was a tool that did that.

——

My second point was more about if git was upgraded to add a new compression algorithm like zstd instead of Deflate, like switching the hash algorithm from SHA-1 to SHA-2 (though if I was in charge, I’d go with Blake3 because it’s far faster)

link

eliomattia 1139 days ago

Fully agree. Compression in many cases removes the ability to diff easily, however. In a large dataset where, in terms of size, 1% of the original data undergoes changes, or new data the size of 1% of the original dataset is added, I think compressing does not compare with just deduplicating the unchanged 99% in terms of storage, but when speed is the #1 factor, the discussion is more nuanced. It might be interesting to have a combination of deduplication and better compression of the changes, in some form, to get the optimal tradeoff. Repo sizes in ML these days are high, I'm curious which repository compression techniques are being evaluated and deployed.

link