Hacker News new | ask | show | jobs
by meiji163 1768 days ago
I wonder if LZ would still be standard, if not for the inertia of gzip/zip? There are surely better and comparably fast algorithms (paq, ppm, etc.)
6 comments

This might not apply to all LZ based programs/algorithms but compress (an old Unix utility based on the Lempel-Ziv-Welch algorithm) fell out of favour after it ran afoul of patent issues (you can read more about it on the compress Wikipedia page [0] and a section of the GIF Wikipedia page [1] which covers more about the enforcement of the patent). From what I can gather though, it enjoyed considerable success and was pretty much the standard utility for compressing data on Unix. I think eventually though it would have been replaced because other algorithms (bzip2, gzip, etc.) have slightly better compression ratios (even if they are more computationally expensive).

[0] - https://en.wikipedia.org/wiki/Compress [1] - https://en.wikipedia.org/wiki/GIF#Unisys_and_LZW_patent_enfo...

LZ is popular because it's computationally efficient. When random memory access is slow but sequential access is fast, you want to encode substrings instead of individual symbols. Explicit higher-order probabilistic models require a lot of memory and generate too many cache misses. Implicit models based on the Burrows-Wheeler transform also have poor memory access patterns.
> if not for the inertia of gzip/zip

I think that's misidentifying where's credit is due: LZ might not be the most efficient compression format when it comes to size, but it's very efficient computation-wise when compressing and decompressing, and this still holds true today even when you consider newer algorthms like Zstandard (which uses a derivative of LZ). Sure, if you need to archive data into its compactest representation at any cost, LZ and derivatives won't deliver it, but unless there's a significant change to how computing works it's still in LZ's favour. At the time DEFLATE was designed, LZ just beat out every competition since realistically you can't run decompression at real-time using minimal memory.

ppm and paq are a LOT slower than LZ77.
PAQ and PPM are not fast, certainly not in the same category as LZ codecs.
zip/gzip are LZ algorithms, as mentioned near the top of the article. Based on LZ77 to be exact.

Did you perhaps mean LZW (gif & compress), which was based on LZ78?

zlib isn't really dominant, today. lzma seems to have overtaken it for anything destined for public distribution.

lzma is slow to decompress. Unless the bandwidth / storage costs are the primary concern, zstd has good enough compression ratio and the very fast decompression makes everything much nicer. Built-in threaded compression, long range compression, binary diff, dictionary compression, etc. are a bonus.
> zstd has good enough compression ratio

I never understood zstd. It's basically lzma+gzip+lz4 packed together. Show me any zstd level that has significantly different speed and data size than one of the levels of those three can't match.

> Show me any zstd level that has significantly different speed and data size than one of the levels of those three can't match.

In case you are not just trolling, I had some very large JSON file with a large amount of text in my SSD around and the compression time was as follows [1]:

    8,826,654,133       original
    4,763,212,322  0:29 lz4 -1
    3,815,508,500  0:52 brotli -1
    3,715,002,172  2:09 lz4 -3
    3,668,204,232  2:27 gzip -1
    3,159,659,113  1:59 brotli -2
    3,118,316,529 10:55 lzma -0 -T4
    3,025,746,073  1:21 zstd -3 -T1 (default)
Zstandard is not just lzma+gzip+lz4. It is better than everything you've mentioned at least for my test case. In fact I first compressed with zstd and tried to match the compression time for others, because zstd is very fast and using anything else as a reference point would have taken me much more time. It does have enough reason to claim itself to be "standard".

[1] Tested with i7-7700 3.60GHz and 48 GiB of RAM (but no algorithm tested use more than several MBs of memory anyway). I'm using a pretty fast SSD here so I/O speed is not much of concern. Also note that every algorithm except for lzma is single-threaded.

Sounds like you're comparing multi-threaded zstd to other utilities that are single-threaded (e.g. gzip instead of pigz).

I duped /usr/share/dict/words into an 8GB file and did a couple tests on the old system I'm on:

  8589934592  original
  3075315128  pigz -1       1m0.44s
  2926825272  zstdmt -3     2m37.52s
  2877549999  pigz -3       1m28.94s
Zstd by default is single-threaded. Just in case though...

    8,826,654,133       original
    3,033,695,892  0:22 zstd 1.3.3 -3 -T4
    3,025,746,073  1:21 zstd 1.3.3 -3 -T1 (default)
    3,017,972,162  1:05 zstd 1.5.0 -3 -T1 (default)
    3,013,860,663  1:23 zstd 1.5.0 -3 --single-thread
It is just no match.

EDIT: axiolite wanted to see --single-thread and while my Linux box only has an older zstd (1.3.3) without that option I realized I do have a Windows executable for recent zstd (1.5.0). Both executables have run in the same machine but I can't guarantee anything about the 1.5.0 binary.

Isn't that exactly the point, and why it's so great? It's a single algorithm that's tune-able across that wide range of speed/ratio trade-offs using a single parameter. So you can just use one thing for almost every application instead of picking between three different things.

(Yes, I know that one parameter controls multiple different settings internally, so there are multiple dimensions if you're willing to dig that deep.)

Anyway, my recollection from looking at benchmarks a while ago: zstd used at similar ratios to lzma compresses in similar time but decompresses much faster, and it's also faster than gzip when set to comparable ratios to that. lz4 is still faster than the fastest zstd modes, and lzma at the most extreme settings still gets better ratios that the best zstd can do. But there's a huge wide swath in the middle where zstd beats them all, and that's quite valuable.

> So you can just use one thing for almost every application instead of picking between three different things.

I honestly can't see how remembering several ranges of zstd levels and their time/speed trade-off is any easier than remembering lz4, pigz, xz, which all happen to have good default setting.

I use zstd once a year, and it is very easy: 1. the default (3 on my machine) is fast and compresses well. 2. if you need to get the best compression level for your use case, use the benchmark "zstd -b1 -e15 yourfile" and after a few minutes you have your answer.

I can't see why I should ever in my life use another program for compressing things than zstd.

It's really not lzma+gzip+lz4, except by if you mean "it's an LZ-alike". It's a modern LZ with repcodes (kinda LZMA-ish, but more like LZX honestly), and it uses a novel arith coder system (Huffman/ANS) instead of a range/arith coder (LZMA) or none (LZ4)
Here are some DB dump compression tests done years ago, look at the pareto frontier and see that zstd wins between LZ4 and LZMA (xz): https://zeevt.github.io/compress_pareto.html