Hacker News new | ask | show | jobs
by Fergi 2793 days ago
This only applies to continuous views, not all PG tables. Think of continuous views in PipelineDB as very high throughput, incrementally updated materialized views. Raw data hits continuous queries in PipelineDB (continuous views) and only the output of the continuous queries is stored. So 1 billion events ingested could be distilled down into a single row that incrementally counts up from 1 => 1 billion as each data point arrives, instead of storing all of the 1 billion raw data points and counting them up later.
1 comments

You can't really do that with distinct, as if you have 1 billion distint entries, you essentially have to store all of them to dedup.
This is precisely why PipelineDB has rich support for data structures such as HyperLogLog [0]. HLL's allow you to track distincts information using fixed-size HLLs that only grow to about 14KB while encoding uniques counts for billions of distinct values. The tradeoff is about a ~0.8% margin of error, which users generally find acceptable.

Furthermore, PipelineDB has a special combine [1] aggregate that allows you to combine data structures such as HLL across multiple rows with no loss of information. A simpler example would be average: to get the actual average of multiple averages you obviously can't simply take the average of all the averages. Their weights must be taken into account, and combine handles that.

The capability to combine aggregate values in this way generalizes to all aggregates in PipelineDB.

[0] http://docs.pipelinedb.com/aggregates.html#hyperloglog-aggre...

[1] http://docs.pipelinedb.com/aggregates.html#combine