Hacker News new | ask | show | jobs
by londons_explore 2986 days ago
If you get compression ratios that good, you should consider if your application might be doing something stupid like storing the same data thousands of times inside it's data file.

If you store enough of the same type of data, invest in redesigning the application. There's a reason we all use jpegs over zipped bitmaps...

3 comments

> There's a reason we all use jpegs over zipped bitmaps...

It's because it's an appropriate compression - just like xz can be? Not sure what you're actually suggesting here.

The suggestion is to design an application-specific format that avoids storing redundant data in the first place. When that's an option at all it gives you higher compression than any general-purpose compression algorithm can achieve.
HTML is pretty repetitive, but if you want to archive HTML data, you don't get to redefine what HTML is. Compression is useful.
This is what the WARC [0] file format (and/or gzip) is for.

[0] https://en.m.wikipedia.org/wiki/Web_ARChive

and/or xz? because xz gives better compression than gzip or warc?
It sounds like his application scraping data of some kind rather than say generating it.