Hacker News new | ask | show | jobs
by oertl 3277 days ago
The tailcut approach breaks one important property of cardinality estimation algorithms: Adding the same value multiple times should change the state of the sketch only once when adding it for the first time.

However, due to the chance of overflows the same value can lead to multiple changes when using the tailcut approach. The introduced error depends on the order and also on the frequency distribution of inserted elements. It would be interesting to know which assumptions on the input data were made for the promised accuracy. In their paper I could not find any hint how the input data looked like in their experiments. It is fairly easy to obtain good results for the trivial case where all input values are distinct.

The tailcut approach reminds me of the HyperBitBit algorithm which also drops the property described above and promises a better storage factor (less error with same amount of memory).

It is true that the traditional 5-bit or 6-bit representations of Hyperloglog registers are suboptimal. Even with lossless compression (see https://pdfs.semanticscholar.org/6558/d618556f812328f969b60b...) a significant amount of memory can be saved.

It is interesting that the standard error is reduced from 1.04/sqrt(m) down to 1.0/sqrt(m) despite the loss of information after the tailcut. Therefore, I conclude that it must be the estimation method itself which is superior to existing methods. I will need to check.

2 comments

Hey oertl, big fan of your paper https://arxiv.org/pdf/1706.07290.pdf :D I going to start working on "intersections" and some improvements of the large scale cardinalities based on your research. Can I hit you up an email with some of my questions (concerning the intersection work? how does it compare to https://arxiv.org/pdf/1704.03911.pdf)
Sure, send me an email.
So if a register overflows (and is truncated) "twice" by the same element before the base register (B) is increased, then after B is increased, the overflowing register is not increased and so appears not to have overflowed. On the other hand, if that element shows up in the stream once before and once after B is increased, the overflowing register may overflow again.

This makes even the MLE procedure partially incorrect in the first case, since it is the maximum likelihood estimation of a set of hashes in which the register discussed above did not overflow.

Is that about right?

I have not yet checked their ML approach. Regarding register updates you are right.

I have also seen that they did only 10000 simulation runs to determine the standard deviation to be 1.00/sqrt(m). I think more simulations would be necessary to get a confidence interval for the standard deviation that is small enough to allow differentiation from 1.04/sqrt(m).