Hacker News new | ask | show | jobs
by ohnoesjmr 2792 days ago
You can't really do that with distinct, as if you have 1 billion distint entries, you essentially have to store all of them to dedup.
1 comments

This is precisely why PipelineDB has rich support for data structures such as HyperLogLog [0]. HLL's allow you to track distincts information using fixed-size HLLs that only grow to about 14KB while encoding uniques counts for billions of distinct values. The tradeoff is about a ~0.8% margin of error, which users generally find acceptable.

Furthermore, PipelineDB has a special combine [1] aggregate that allows you to combine data structures such as HLL across multiple rows with no loss of information. A simpler example would be average: to get the actual average of multiple averages you obviously can't simply take the average of all the averages. Their weights must be taken into account, and combine handles that.

The capability to combine aggregate values in this way generalizes to all aggregates in PipelineDB.

[0] http://docs.pipelinedb.com/aggregates.html#hyperloglog-aggre...

[1] http://docs.pipelinedb.com/aggregates.html#combine