Hacker News new | ask | show | jobs
by sounds 1507 days ago
Can you test it out and post results back here?
2 comments

For linux-5.17.6.tar:

Original file: 129MB xz, 1.2G uncompressed.

"zstd -T0": 1.34 seconds, 189M

"xz -T0": 63 seconds, 131M

"xz -T0 -9": 183 seconds, 125M

"bzip3 -e -j 6": 21 seconds, 129M (edited, was SIGSEGV)

"bzip3 -e": 84 seconds, 129M

I used linux source because the source website uses linux and recommends bzip3 for compressing source and text. Results were on Ubuntu 22.04, Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz

One more note:

"bzip3 -e -j 6 -b 50": 25 seconds, 125MB

So nearly as good as the best of xz, but in a 20th the time.

However: Do note that any unexpected use is met with a SIGSEGV: using as a filter, using "-j6" instead of "-j 6", not specifying "-e"...

lies. not specifying -e displays an error message:

  % bzip3 -e -j 6 -b 50 corpus/calgary.tar
  % bzip3 -j 6 -b 50 corpus/calgary.tar
  bzip3 - A better and stronger spiritual successor to bzip2.
  Copyright (C) by Kamila Szewczyk, 2022. Licensed under the terms of GPLv3.
  Usage: bzip3 [-e/-d/-t/-c] [-b block_size] input output
  Operations:
    -e: encode
    -d: decode
    -t: test
  Extra flags:
    -c: force reading/writing from standard streams
    -b N: set block size in MiB
    -j N: set the amount of parallel threads
you can use bzip3 as a filter:

  % cat corpus/calgary.tar | bzip3 -b 10 -e -c | wc -c
  807959
and using "-j6" is simply being unable to read the help page.
The code has, at https://github.com/kspalaiologos/bzip3/blob/bf2f0e02fd59f4c4... :

            } else if (argv[i][1] == 'j') {
                workers = atoi(argv[i + 1]);
                i++;
If the last argument is "-j6" then this will read past the end of the allocated argv strings and try to do atoi(NULL):

  % ./bzip3 -j3 < README.md
  Segmentation fault
"-j6" is standard getopt() behavior, and the default expected behavior from Unix/POSIX systems.
They are not disputing that - they actually acknowledged it. Instead they are disputing that omitting -e will lead to a crash.
You're right, what I thought was encoding/filter SEGVs were the bug in handling "-j6".
If you ramp up the compression level on zstd, does it get smaller than bzip3 before it gets to the point of taking more time?
No, not even if you disregard time altogether. Compression of Calgary Corpus on a random old laptop:

             sec   KB
  gzip      0.19  1070
  zstd      0.02  1063
  zstd -19  1.47   897
  bzip2     0.35   891
  brotli    8.12   862
  xz        1.35   853
  bzip3     0.52   808
it's `bzip3 -e -j 6`. you need a space.
Can you also give decompression speeds?
"zstd -d": 1.03 seconds

"xz -d": 7.92 seconds

"bzip3 -d" as filter: SEGSEGV

"bzip3 -d linux-5.17.6.tar.bz3": 81 seconds

"bzip -d -j 6": 21.35 seconds (edit)

Lest I be called a liar again:

  $ ./bzip3-1.0.1/bzip3 -d <linux-5.17.6.tar.bz3 >/dev/null
  fish: Job 1, 'time ./bzip3-1.0.1/bzip3 -d <li…' terminated by signal SIGSEGV (Address boundary error)
Consider using `-c`, which makes the compressor use standard streams, or pull the main branch because I had just pushed a tiny patch that automatically enables it when no positional arguments are given.
No, but it would be nice to see visually where it is on the size vs (de-)compression speed pareto-front. Like this graphic (from the zstd homepage): https://raw.githubusercontent.com/facebook/zstd/master/doc/i...
This chart is from 2016. Both zstd and brotli are under active development, so I'd like to see a more recent comparison presented in this same format.