Hacker News new | ask | show | jobs
by millisecond 4417 days ago
HLL also has two nice real-world optimizations possible depending on use-case.

We're storing 100,000+ unique counters, but only around 1% have more than 100 unique objects counted. Some of those 1% have millions of records so HLL is very useful. As the HLL itself is a fixed size (~10kb for decent accuracy) regardless of #counted objects, in the small case you can replace the HLL with a pure set of counted values and produce a HLL when it grows beyond a bound. Because you're storing the raw values, the transition to HLL is seamless.

Once you've moved beyond raw storage of values there's a harder but still space-saving technique. If you look at the raw bytes of a ~10kb HLL structure with "only" 10's of thousands of counted values around 90% of them will be zero. Below a certain bound it can save a lot of space to have a map of locations and non-zero byte values rather than a raw array of bytes.

1 comments

One thing people forget about in all the excitement over HLL's is how effectiveness of compressed bitsets, which aren't lossy and so yield precise answers. They exploit the same "90% of them will be zero" phenomenon for space and execution efficiency, but are much more flexible... in exchange for consuming more memory and being slower than HLL's.