Hacker News new | ask | show | jobs
by vmp 3158 days ago
The next default compressor might be lrzip [1] by Con Kolivas; I've only see it a couple of times in the wild so far but for certain files it can increase the compression ratio quite a bit.

[1] https://github.com/ckolivas/lrzip

  # 151M    linux-4.14-rc6.tar.gz
  # GZIP decompression
  ~$ time gzip -dk linux-4.14-rc6.tar.gz

  real    0m4.518s
  user    0m3.328s
  sys     0m13.422s

  # 787M    linux-4.14-rc6.tar
  # LRZIP compression
  ~$ time lrzip -v linux-4.14-rc6.tar
  [...]
  linux-4.14-rc6.tar - Compression Ratio: 7.718. Average Compression Speed: 13.789MB/s.
  Total time: 00:00:56.37

  real    0m56.533s
  user    5m35.484s
  sys     0m9.422s

  # 137M    linux-4.14-rc6.tar.lrz
  # LRZIP decompression
  ~$ time lrzip -dv linux-4.14-rc6.tar.lrz
  [...]
  100%     786.16 /    786.16 MB
  Average DeCompression Speed: 131.000MB/s
  Output filename is: linux-4.14-rc6.tar: [OK] - 824350720 bytes
  Total time: 00:00:06.35

  real    0m6.524s
  user    0m8.031s
  sys     0m1.766s

  # Results
  ~$ du -hs linux* | sort -h
  137M    linux-4.14-rc6.tar.lrz
  151M    linux-4.14-rc6.tar.gz
  787M    linux-4.14-rc6.tar

tested on WSL (Ubuntu BASH for Windows 10)

edit:

  ~$ time xz -vk linux-4.14-rc6.tar
  linux-4.14-rc6.tar (1/1)
    100 %        98.9 MiB / 786.2 MiB = 0.126   3.0 MiB/s       4:25

  real    4m25.189s
  user    4m23.828s
  sys     0m1.094s
  
  ~$ du -hs linux* | sort -h
  99M     linux-4.14-rc6.tar.xz
  137M    linux-4.14-rc6.tar.lrz
  151M    linux-4.14-rc6.tar.gz
  787M    linux-4.14-rc6.tar
It looks like XZ still has the best compression ratio but also took the longest (real)time.
3 comments

lrzip is a preprocessor that finds matches in the distant past that the backend compressor (xz) couldn't normally find. zstd has a new long range matcher mode inspired by the ideas behind rzip/lrzip with some extra tricks. It produces data in the standard zstd format, so can be decompressed the the normal zstd decompressor. There is a short article about it in the latest release notes https://github.com/facebook/zstd/releases/tag/v1.3.2
I tried a few times with zstd at various levels of compression with the linux kernel sources. While I've been impressed with zstd, and have some projects lined up to use it, it seems in the case of the linux kernel sources, it's not a great fit. xz handily beats it, and not by a small margin either. I had to really ratchet up the compression levels (20+) before I could get close to 100Mb.
In general, xz beats zstd in compression ratio, as xz is very committed to providing the strongest compression, at the expense of speed, while zstd provides a range of compression ratio vs speed tradeoffs [0]. At the lower levels, zstd isn't approaching xz's compression level, but it's doing it much much faster. Additionally, zstd generally massively outperforms xz in decompression speed

  $ time xz linux-4.14-rc6.tar

  real    4m26.009s
  user    4m24.828s
  sys     0m0.724s

  $ wc -c linux-4.14-rc6.tar.xz
  103705148 linux-4.14-rc6.tar.xz

  $ time zstd --ultra -20 linux-4.14-rc6.tar
  linux-4.14-rc6.tar   : 12.81%   (824350720 => 105564246 bytes, linux-4.14-rc6.tar.zst)

  real    4m34.129s
  user    4m33.484s
  sys     0m0.432s

  $ time cat linux-4.14-rc6.tar.xz | xz -d > out1                                                                                                                                           

  real    0m9.677s
  user    0m6.608s
  sys     0m0.704s

  $ time cat linux-4.14-rc6.tar.zst | zstd -d > out2

  real    0m1.702s
  user    0m1.220s
  sys     0m0.520s
[0]: https://github.com/facebook/zstd/blob/dev/doc/images/DCspeed...
While making no judgement against lrzip, I'll point out that out-performing gzip is pretty much the baseline as far as compression goes. More interesting comparison would be against some modern compressors like zstd: http://facebook.github.io/zstd/
lrzip is somewhat more polished implementation of same idea as rzip by Andrew Tridgell. That means: use rsync's rolling hashing algorithm to implement LZ78 with enormous dictionary size and compress output of that with some general purpose compression algorithm (bzip2 in rzip, IIRC in lrzip it is configurable)