Hacker News new | ask | show | jobs
by tln 604 days ago
> get ~20% of their content changed

...daily? monthly? how many versions do you have to keep around?

I'd look at a simple zstd dictionary based scheme, first. Put your history/metadata into a database. Put the XML data into file system/S3/BackBlaze/B2, zstd compressed against a dictionary.

Create the dictionary : zstd --train PathToTrainingSet/* -o dictionaryName Compress with the dictionary: zstd FILE -D dictionaryName Decompress with the dictionary: zstd --decompress FILE.zst -D dictionaryName

Although you say you're fine with it being not that storage efficient to a degree, I think if you were OK with storing every version of every XML file, uncompressed, you wouldn't have to ask right?

1 comments

If one stores a whole versions of the files that defeats the idea of git, and would consume too much space. I suppose I don't even need zstd if I have ZFS with compression, although compression levels won't be as good.
You're relying on compression either way... my hunch is that controlling the compression yourself may get you a better result.

Git does not store diffs, it stores every version. These get compressed into packfiles https://git-scm.com/book/en/v2/Git-Internals-Packfiles. It looks like it uses zlib.