Hacker News new | ask | show | jobs
by wmertens 1014 days ago
> CONCLUSION

> We introduced BtrBlocks, an open columnar compression format for data lakes. By analyzing a collection of real-world datasets, we selected a pool of fast encoding schemes for this use case. Additionally, we introduced Pseudodecimal Encoding, a novel compression scheme for floating-point numbers. Using our sample-based compression scheme selection algorithm and our generic framework for cascading compression, we showed that, compared to existing data lake formats, BtrBlocks achieves a high compression factor, competitive compression speed and superior decompression performance. BtrBlocks is open source and available at https://github.com/maxi-k/btrblocks.

1 comments

It is interesting and I'd love to look over some details benchmarks on the differences. Storing floats as integers overcome several of their challenges. The example of dollar units would be a good candidate for a short delta compression.

I doubt I'd ever used columnar compression again as I felt it too difficult to fight DBAs on keeping the original sorting and schema preserved in an optimal way. I do find it really interesting though.