| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by palaiologos 1504 days ago

You've literally tested it on a single file, enwik8. That's not enough to extrapolate valuable results. One of the benchmarks:

  time ./bsc e ../linux.tar linux.bsc -e2 -b16 -T
  68.69s user 1.14s system 99% cpu 117M memory 1:09.84 total

While bzip3 uses 98M, takes 1min 17s to produce a 129023171 byte file, compared to 127747834B from BSC. They're very similar except bzip3 tends to use less memory and decompresses a little slower. BSC is much more mature than bzip3 though, and the benchmarks might be a subject to change some time in the future. Surprisingly, BSC code isn't really that robust (I reported a UB bug to libsais and had to pretty much rework the LZP code because it couldn't stand fuzzing).

1 comments

jkbonfield 1504 days ago

Well yes it was one file, but it was stated as being good on text and enwik8 is a pretty standard test corpus for text compressors.

I could have done more, but it somewhat vindicated what I was saying really. It has a very similar core to bsc (based on the same code) and gives very similar file sizes as expected. Note you may wish to use bsc -tT to disable both forms of threading. I don't know if that changes memory usage any.

Have you tried making PRs back to libbsc github to fit the UB and fuzzing issues? I'm sure the author would welcome fixes given you've already done the leg work.

Anyway, please do consider benchmarking against libbsc. It's conspicuously absent given the shared ancestry.

palaiologos 1504 days ago

I haven't figured a libsais fix and my LZP fix changes the functionality a little (removes chunking for better compression at a rather small runtime cost), so I don't think the author would like me to submit it. I have opened tickets, though.