|
|
|
|
|
by oertl
3277 days ago
|
|
The tailcut approach breaks one important property of cardinality estimation algorithms: Adding the same value multiple times should change the state of the sketch only once when adding it for the first time. However, due to the chance of overflows the same value can lead to multiple changes when using the tailcut approach. The introduced error depends on the order and also on the frequency distribution of inserted elements. It would be interesting to know which assumptions on the input data were made for the promised accuracy. In their paper I could not find any hint how the input data looked like in their experiments. It is fairly easy to obtain good results for the trivial case where all input values are distinct. The tailcut approach reminds me of the HyperBitBit algorithm which also drops the property described above and promises a better storage factor (less error with same amount of memory). It is true that the traditional 5-bit or 6-bit representations of Hyperloglog registers are suboptimal. Even with lossless compression (see https://pdfs.semanticscholar.org/6558/d618556f812328f969b60b...) a significant amount of memory can be saved. It is interesting that the standard error is reduced from 1.04/sqrt(m) down to 1.0/sqrt(m) despite the loss of information after the tailcut. Therefore, I conclude that it must be the estimation method itself which is superior to existing methods. I will need to check. |
|