Hacker News new | ask | show | jobs
by jerrre 1075 days ago
I like the idea, but just tested your specific example with https://facia.dev/tools/compress-decompress/gzip-compress/

and the second version compresses better, as my intuition already thought, without knowing how gzip works, because you still need a kind of identifier for repeated strings, and how much shorter than 1 byte can they be.

2 comments

I ran some tests, and with zstd, brotli, and gzip, single-byte identifiers win on even slightly more complicated examples. If you have more than N identifiers (where N is the valid number of single-byte identifiers) then using short strings from the brotli dictionary seems to be at worst break-even for brotli.

Also, there seems to be a 40-byte minimum size on zstd; Making the original example significantly smaller (or larger with repeated patterns) all yielded 40 byte files. I had to change it to e.g. print(hello);foo(hello); bar(hello) to get any difference between using "x" and "hello" as an identifier.

OP's suggestion may work better on real life examples of source code (which would be longer and include a lot more different kinds of repeated strings)