Hacker News new | ask | show | jobs
by eliomattia 1139 days ago
Fully agree. Compression in many cases removes the ability to diff easily, however. In a large dataset where, in terms of size, 1% of the original data undergoes changes, or new data the size of 1% of the original dataset is added, I think compressing does not compare with just deduplicating the unchanged 99% in terms of storage, but when speed is the #1 factor, the discussion is more nuanced. It might be interesting to have a combination of deduplication and better compression of the changes, in some form, to get the optimal tradeoff. Repo sizes in ML these days are high, I'm curious which repository compression techniques are being evaluated and deployed.