Hacker News new | ask | show | jobs
by dragontamer 801 days ago
I'm not exceptionally good with compression, but I did a few experiments.

I've noticed that array-of-structs form is harder to compress than struct-of-arrays form. This is relevant because a floating-point number is really a struct containing 3 values: 1-bit sign, 8-bit exponent, and 23-bit mantissa.

--------

If you have 1000 floats, I would expect 1000-sign bits in order to be more compressible (whatever compression algorithm you use, it would better determine the pattern of signs).

Then 8000-bits of exponent would also be more compressible, because all the exponents would likely follow its own pattern.

Finally, the 23-bits of mantissa would probably be difficult to impossible for compression. But "extracting it out" before running a compression algorithm will give the sign + exponents more opportunity for comrpession.

--------

So yeah, just think about your data. And favor structure-of-arrays form for better compression.

1 comments

Totally, SoAs are often easier to compress. I hadn’t thought about floats as being AoSs, and then unpacking the sign, exponent and mantissas of a batch of float into separate arrays, that’s an interesting way to look at it. You could probably take that 1 step further with floats or ints, and treat an array of numbers as an array of structures of 32 or 64 fields where every field is 1 bit. So to compress a batch of 1000 32-bit numbers, take all bit 31s in a row, then all bit 30s, etc.. Or another way to look at it is to stack the numbers in a column, transpose it, and read out the bits.

It’s worth noting that the algorithm in the article, and it’s cousin, delta encoding, both do already take advantage of any cases where the sign & exponent bits don’t change with every sample, and/or where the mantissa ULPs are constant for a while at a time. It’d be interesting to compare the explicit SoA of floats compression idea to XOR or delta compression, but I think it’s also interesting in a meta kinda of way that these two seemingly different ideas have similar outcomes.