Hacker News new | ask | show | jobs
by rwaksmunski 360 days ago
I use this crate to process 100s of TB of Common Crawl data, I appreciate the speedups.
3 comments

What's the reason for using bz2 here? Wouldn't it be faster to do a one off conversion to zstd? It beats bzip2 in every metric at higher compression levels as far as I know.
Common Crawl delivers the data as bz2. Indeed I store intermediate data in zstd with ZFS.
That assumes you're processing the data more than once.
Is this data available as torrents?
Yeah came here to say a 14% speed up in compression is pretty good!
bzip2 (particularly parallel implementations thereof) are already relatively competitive for compression. The decompression time is where it lags behind because lz77 based algorithms can be incredibly fast at decompression.
It's blazingly fast