Hacker News new | ask | show | jobs
by atomicity 1881 days ago
If you have a PBs of data that you rarely access, it seems to make sense to compress it first.

I've rarely seen any non-giants with PBs of data properly compressed. For example, small JSON files converted into larger, compressed parquet files will use 10-100x less space. I am not familiar with images but see no reason why encoding batches of similar images should make it hard to get similar or even better compression ratios

Also, if you decide to move off later on, your transfer costs will also be cheaper if you can move it off in a compressed form first.

1 comments

couple be wrong but I don't believe compression of batches of compressed images compresses well

but it'd be very interested to here about techniques on this because I have a lot of space eaten up by timelapses myself

It's not about space reduction, it's about handling the small file problem. HDFS can handle up to 500M files without issue but the amount of RAM needed to store the files' metadata starts to go beyond what you'd typically find in a single server these days.

When you store multiple images and/or videos inside of a single PQ file, you'll end up keeping fewer files on your server.

I believe Uber store JPEG data in PQ files and Spotify store audio files in PQ or a similar format on their backend.

On the contrary, batches of images with a high degree of similarity compress _very_ well. You have to use an algorithm specifically designed for that task though. Video codecs are a real world example of such - consider that H. 265 is really compressing a stream of (potentially) completely independent frames under the hood.

I'm not sure what the state of lossless algorithms might be for that though.

Best I know of for that is something like lrzip still, but even then it's probably not state of the art. https://github.com/ckolivas/lrzip

It'll also take a hell of a long time to do the compression and decompression. It'd probably be better to do some kind of chunking and deduplication instead of compression itself simply because I don't think you're ever going to have enough ram to store any kind of dictionary that would effectively handle so much data. You'd also not want to have to re-read and reconstruct that dictionary to get at some random image too.

A movie is a series of similar images and while it does allow temporal compression in a 3rd axis to the 2d raster, H265 is about as good as it gets at the moment but its also lossy which might not be tolerable.
H266 VVC looks impressive. Waiting to get my hands on fpga codec for testing.
right but we're not talking about compressing a video stream but compressing individually compressed pictures, big difference