Hacker News new | ask | show | jobs
by jwilliams 2984 days ago
I sent a reasonable amount of data to Cloud Storage. It varies a lot. Usually ~10GB/day, but it can be up to 1TB/day regularly.

xz can be amazing. It can also bite you.

I've had payloads that compress to 0.16 with gzip then compress to 0.016 with xz. Hurray! Then I've had payloads where xz compression is par, or worse. However, with "best or extreme" compression, xz can peg your CPU for much longer. gzip and bzip2 will take minutes and xz -9 is taking hours at 100% CPU.

As annoying as that is, getting an order of magnitude better in many circumstances is hard to give up.

My compromise is "xz -1". It usually delivers pretty good results, in reasonable time, with manageable CPU/Memory usage.

FYI. The datasets are largely text-ish. Usually in 250MB-1GB chunks. So talking JSON data, webpages, and the like.

2 comments

If you can compress data this much, you seem to have a lot of repetitive data. Have you tried using compression algorithms that support custom dictionaries? ZSTD and DEFLATE support those and can maybe help with compression ratio as well as speed.
If you get compression ratios that good, you should consider if your application might be doing something stupid like storing the same data thousands of times inside it's data file.

If you store enough of the same type of data, invest in redesigning the application. There's a reason we all use jpegs over zipped bitmaps...

> There's a reason we all use jpegs over zipped bitmaps...

It's because it's an appropriate compression - just like xz can be? Not sure what you're actually suggesting here.

The suggestion is to design an application-specific format that avoids storing redundant data in the first place. When that's an option at all it gives you higher compression than any general-purpose compression algorithm can achieve.
HTML is pretty repetitive, but if you want to archive HTML data, you don't get to redefine what HTML is. Compression is useful.
This is what the WARC [0] file format (and/or gzip) is for.

[0] https://en.m.wikipedia.org/wiki/Web_ARChive

and/or xz? because xz gives better compression than gzip or warc?
It sounds like his application scraping data of some kind rather than say generating it.