Hacker News new | ask | show | jobs
by roncohen 1965 days ago
another contender is zstd: https://github.com/facebook/zstd. It typically offers better compression ratios than LZ4 at a slight (depending on your data) cost in speed. Additionally it offers a training mode to tune the algorithm to increase compression ratio on specific types of data, particularly useful for compression of small pieces of data.
5 comments

> It typically offers better compression ratios than LZ4 at a slight (depending on your data) cost in speed

Per the table at [0], zstd provides only a slight improvement in compression ratio, and in exchange is about half the speed of lz4.

They both have their place.

0. https://facebook.github.io/zstd/

That table shows zstd comparing poorly at its faster settings, but at slower settings, it offers a significantly better compression ratio, albeit 3x slower decompression.
LZ4 has branchless decompression, and lower cache footprint, thus it can work on low end, and non-desktop CPUs equally well.

zstd, brotly, snappy were seemingly all made with high end x86 capabilities in mind.

I also appreciate LZ4's simplicity and tiny code footprint.

zstd is brilliant as well, but in terms of code base it's a whole other beast.

Yes decompression on baremetal cortex m4 is a mere hundreds of bytes, you can decompress it from flash directly to its output buffer.
I've used it in bootloaders that have slow transfer mechanisms (uart, i2c) to get whatever speedup I can for a few hundred bytes of binary.
Google snappy is same class as lzo and lz4, not same class as brotli and zstd.
Also see Daniel Reiter Horn's DivANS built at Dropbox: https://dropbox.tech/infrastructure/building-better-compress...
Zstd is very different - it includes an entropy coder. LZ4 only finds repeated matches, but then doesn't encode them very efficiently.

To put it simplistically, if you have a file which is a (good) random mix of an equal number A and B characters, LZ4 won't be able to compress it significantly, while Zstd will compress it 8:1 converging to an encoding where a '1' bit is A, and a '0' bit is B.

> To put it simplistically, if you have a file which is a (good) random mix of an equal number A and B characters, LZ4 won't be able to compress it significantly

I checked it. LZ4 is still reducing the size to half, no idea why half. So for 10 MB file it compresses to 5 MB.

Edit: checked with highest compression and it compresses 1MB file to 185KB. So what the parent wrote is false.

Yes, if I take the 8 combinations aaa, aab, aba etc and assign each of them a 9 bit codeword I replace each 24 bit sequence with a 9 bit sequence. So arithmetic coders have no problem with cases like this.
but LZ4 doesn't have a arithmetic coder, or any other statistical encoding - it's just matches and literals. Puzzling...
Yep, Zstd is the spiritual successor to LZ4 and written by the same person (Yann Collet) after they got hired by Facebook.
Actually, I seem to recall that he was working on it before getting hired by Facebook (unless there was a massive delay in the hiring to become known). I was following his excellent blog posts on the matter at the time.
Yes it was a fully working things before facebook. There has been a lot of improvement in both the core and cli. But the core innovations of zstd was well established before facebook. I was probably following his blogs (even though I wasn't a compression expert) for months before I saw the post about his joining facebook.
Yann wrote LZ4 and Zstd well before joining FB. I have to applaud FB for supporting Yann's work, though.
I've spent an afternoon testing zstd's custom dictionaries. It really only provides benefits on small data blocks. According to my tests, the largest blocks at which custom dictionaries could still provide a benefit is 8K, above that the compression ratio advantage compared to the default is definitely gone.
> Additionally it offers a training mode to tune the algorithm to increase compression ratio on specific types of data

Yes, however there is usually no facility to train your compression algo with most tools using ZSTD.

There should be a way to pool standard dictionaries somewhere, such as a "standard english text corpus data" dictionary, that you can then download on demand for encoding, say, BLOB text fields in a database with little to no overhead.

The way this would probably work without this facility though, say, in a database, is that the dictionary is maintained internally and constructed on the fly from the field data and not exposed to users. Although, I don't know if you'd have to keep every version of the dictionary in order to successfully decompress old data? If so then perhaps this is a niche feature

W.r.t. standard dictionaries, it's something we're interested in, but the fundamental reality of dictionaries is that their effectiveness is strongly tied to their specificity. Put another way, a universal dictionary is a self-contradiction.

And yes, totally, I know at least RocksDB supports exactly that behavior [0].

[0] https://github.com/facebook/rocksdb/blob/12f11373554af219c51...