Hacker News new | ask | show | jobs
by terrelln 759 days ago
Variable-length integer schemes are great when you are encoding one integer. But when you are encoding a list of integers, you should really consider a different scheme.

Variable-length integer schemes generally interleave the control bits & data bits. This means you don't know where the next integer starts until you at least partially decode the current integer.

When encoding a list of integers, you would rather put all the control information together, and all the data together. This generally allows for significantly more efficient decoding using SIMD.

2 comments

Don’t forget column-stores. Lists of floats and ints are generally “full size” but then XORed with the preceding value then the result is compressed in chunks with a high-speed compressor like LZ4. We see 20x or higher compression ratios on many of our column-stores which contain data such as invoice amounts as metrics. Decompression overhead is below zero (much faster than operating on the raw uncompressed data which is bound by memory bandwidth or IO)
Yeah, see Stream VByte [1] for more information.

[1] https://arxiv.org/abs/1709.08990