Hacker News new | ask | show | jobs
by crescentfresh 2795 days ago
Looking over this cursorily, looks super cool.

    INSERT INTO events_stream (ts, value) VALUES (now(), '0ef346ac');
> As soon as the continuous view reads new incoming events and the distinct count is updated the reflect new information, the raw events will be discarded.

So you create a table, insert into it, and it's always empty. Is that right?

Does this work for any table in pg? How does pg know that the insert should NOT actually insert a row?

2 comments

This only applies to continuous views, not all PG tables. Think of continuous views in PipelineDB as very high throughput, incrementally updated materialized views. Raw data hits continuous queries in PipelineDB (continuous views) and only the output of the continuous queries is stored. So 1 billion events ingested could be distilled down into a single row that incrementally counts up from 1 => 1 billion as each data point arrives, instead of storing all of the 1 billion raw data points and counting them up later.
You can't really do that with distinct, as if you have 1 billion distint entries, you essentially have to store all of them to dedup.
This is precisely why PipelineDB has rich support for data structures such as HyperLogLog [0]. HLL's allow you to track distincts information using fixed-size HLLs that only grow to about 14KB while encoding uniques counts for billions of distinct values. The tradeoff is about a ~0.8% margin of error, which users generally find acceptable.

Furthermore, PipelineDB has a special combine [1] aggregate that allows you to combine data structures such as HLL across multiple rows with no loss of information. A simpler example would be average: to get the actual average of multiple averages you obviously can't simply take the average of all the averages. Their weights must be taken into account, and combine handles that.

The capability to combine aggregate values in this way generalizes to all aggregates in PipelineDB.

[0] http://docs.pipelinedb.com/aggregates.html#hyperloglog-aggre...

[1] http://docs.pipelinedb.com/aggregates.html#combine

I'm Derek, one of the co-founders--great questions!

> So you create a table, insert into it, and it's always empty. Is that right?

That is correct. Streams can only be read by continuous queries (e.g. you can't even run a SELECT on them).

> Does this work for any table in pg? How does pg know that the insert should NOT actually insert a row?

PipelineDB streams are represented as a specific kind of PostgreSQL foreign table [0], so only foreign tables created in a specific way will be considered streams. You can use triggers to write table rows and updates out to streams if you want to though.

[0] https://www.postgresql.org/docs/current/static/sql-createfor...