Hacker News new | ask | show | jobs
Many times faster (de)compression using multiple processors. (iasylum.net)
37 points by igorhvr 5209 days ago
7 comments

It's good people get interested in the subject. But this is very odd and has some errors. For example xz requires a lot more memory resources than bzip2 (see benchmarks below, Mem column).

http://mattmahoney.net/dc/text.html

http://mattmahoney.net/dc/uiq/

Matt Mahoney mantains the best benchmarks on text and generic compression. Some of the best on the field (like Matt) usually hang out at encode.ru.

By resources I meant wall-time, mostly - what I am optimizing for (I move >200GB zfs snapshots around a lot..). I did not pay attention to memory, specifically - in that light I updated the side note.

What else did you find odd/wrong?

You don't mention ratios. Better compression becomes asymptotically/exponentially harder when it gets close to the optimal code size.

Thanks for the shout out. BTW Matt's benchmarks are for all major compression engines, not only bzip2.

That benchmark runs xz with the -9e flags, which turn on its slowest and most memory-intensive mode. If you pass it -0 it only needs 3 MB to compress and 1 MB to decompress.

I usually use -0 with xz because it is extremely fast and memory-efficient yet still compresses better than gzip or bzip2. You can also use -0e for a slower, better compression that still requires only 1 MB to decompress. This way the decompressor can run entirely within the CPU's cache.

Yes, it's in the benchmarks. But comparing against bzip2 is not very meaningful. It's more relevant to compare with compressors in the same efficiency rate. For those flags it compresses to 26MB, in that range there are many equivalent ROLZ/LZP engines with similar numbers. For example csc32 is a bit faster but uses more memory.

http://mattmahoney.net/dc/text.html#2118

http://mattmahoney.net/dc/text.html#2300

Proper analysis would need benchmarking with different data and different flags for all compressors.

Is this decompressing a single stream on multiple processors? My knowledge of gzip is very limited, but I would have thought sequential processing was required. What's the trick here? (TFA doesn't explain anything, and e.g. pigz homepage doesn't either).
You are right. The .pdf version of the user manual of pigz (at http://zlib.net/pigz/pigz.pdf ) has a few details (the bottom of this answers your question):

"Pigz compresses using threads to make use of multiple processors and cores. The input is broken up into 128 KB chunks with each compressed in parallel. The individual check value for each chunk is also calculated in parallel. The compressed data is written in order to the output, and a combined check value is calculated from the individual check values.

The compressed data format generated is in the gzip, zlib, or single-entry zip format using the deflate compression method. The compression produces partial raw deflate treams which are concatenated by a single write thread and wrapped with the appropriate header and trailer, where the trailer contains the combined check value.

Each partial raw deflate stream is terminated by an empty stored block (using the Z_SYNC_FLUSH option of zlib), in order to end that partial bit stream at a byte boundary. That allows the partial streams to be concatenated simply as sequences of bytes. This adds a very small four to five byte overhead to the output for each input chunk. The default input block size is 128K, but can be changed with the -b option. The number of compress threads is set by default to the number of online processors, which can be changed using the -p option. Specifying -p 1 avoids the use of threads entirely. The input blocks, while compressed independently, have the last 32K of the previous block loaded as a preset dictionary to preserve the compression effectiveness of deflating in a single thread. This can be turned off using the -i or --independent option, so that the blocks can be decompressed independently for partial error recovery or for random access.

Decompression can’t be parallelized, at least not without specially prepared deflate streams for that purpose. As a result, pigz uses a single thread (the main thread) for decompression, but will create three other threads for reading, writing, and check calculation, which can speed up decompression under some circumstances. Parallel decompression can be turned off by specifying one process ( -dp 1 or -tp 1 )."

I also just updated the article with this info.

Interesting. How is this capable of extracting LZMA "solid" archives where all files are part of a single stream to achieve better (and very much slower) compression?
Had to try this on my quad core laptop, as I never heard of these tools .

  josh@snoopy:~/Downloads $ grep -m2 -i intel  /proc/cpuinfo 
  vendor_id       : GenuineIntel
  model name      : Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz

  josh@snoopy:~/Downloads $ ls -l test 
  -rw-r--r-- 1 josh josh 1073741824 2012-03-07 20:06 test

  josh@snoopy:~/Downloads $ time gzip test
  real    0m16.430s
  user    0m10.210s
  sys     0m0.490s

  josh@snoopy:~/Downloads $ time pigz test 
  real    0m5.028s
  user    0m16.040s
  sys     0m0.620s
Looks good.. although the man page describes it as being "an almost compatible replacement for the gzip program".
That's actually a dual core with hyperthreading. (I'm not just being pedantic, I'm curious as to how well hyperthreading works.) http://ark.intel.com/products/52224

I wonder what the speed looks like with two threads? (pigz -p 2) ?

But what's the compression ratio and what's the test data?
Test data

  josh@snoopy:~/Downloads $ dd if=/dev/zero of=test bs=1024K count=1000
  
  josh@snoopy:~/Downloads $ gzip -l test.gz 
  compressed        uncompressed  ratio   uncompressed_name
  1176024          1048576000  99.9% test.txt
with -p2 it runs ~2 sec slower
For that kind of data you can just use RLE instead. It's the trivial case (an edge case in fact).
Is xz less resource intensive then bzip2? My testing (admittedly two years ago or so) showed significant differences, better compression ratio with xz but significantly longer and/or more memory used.
In the tests I did (including compression of an entire filesystem image), xz always compresses better for the same effort (amount of time).

One example (using the standard tools, not the multithreaded ones):

  igorhvr:tmp/ $ ls -la co_2011-08_import.mdb*
  -rw-r--r-- 1 igorhvr igorhvr 261832704 2011-10-11 07:19 co_2011-08_import.mdb
  igorhvr:tmp/ $ sudo time nice -n -20 bzip2 -k co_2011-08_import.mdb
  42.85user 0.14system 0:43.16elapsed 99%CPU (0avgtext+0avgdata 31968maxresident)k 112inputs+0outputs (3major+2268minor)pagefaults 0swaps
  igorhvr:tmp/ $ sudo time nice -n -20 xz -3 -k co_2011-08_import.mdb
  30.99user 0.19system 0:31.48elapsed 99%CPU (0avgtext+0avgdata 130640maxresident)k 96inputs+0outputs (2major+8389minor)pagefaults 0swaps
  igorhvr:tmp/ $ ls -la co_2011-08_import.mdb*
  -rw-r--r-- 1 igorhvr igorhvr 261832704 2011-10-11 07:19 co_2011-08_import.mdb
  -rw-r--r-- 1 igorhvr igorhvr  24020243 2011-10-11 07:19 co_2011-08_import.mdb.bz2
  -rw-r--r-- 1 igorhvr igorhvr  23949408 2011-10-11 07:19 co_2011-08_import.mdb.xz
In this case xz compressed roughly to the same size (slighly better than bzip2, actually) using only 72% of the time. This is often the case.
Interesting I may have been comparing parallel bzip2 with single threaded xz or perhaps it was just the data I was using for a test.

I'll have to look at it again, thanks!

Put two spaces before your lines of terminal output to prevent HN from munging them.
Done - thanks
If you're handling a lot of data it make sense to hash-partition it on some key and spread it out to a large number of files.

In that case you might have, say, 512 partitions and you can farm out compression, decompression and other tasks to as many CPUs as you want, even other machines in a cluster.

I like to use PPMd (via 7zip) for large volumes of text, but it seems to only be single-threaded, which is a shame. It cuts a good 30% again off the size of the .xml.bz2's that Wikipedia provides.
This is awesome since compression in parallel has been largely neglected in practice.
It's not neglected, it's just hard and a lot of projects failed. But there are many papers on the subject in the last 10 years. Also parallel is a vague term, it can mean multicore, SIMD, GPU and clusters. All different, with a unique set of tradeoffs.