Hacker News new | ask | show | jobs
by cevian 1695 days ago
(NB: post author here)

Great question. We support average of averages by storing the intermediate state of the aggregate (for average that's the sum and count) so we could cleanly re-aggregate.

Eventually, we'll be able to incrementally update the aggregate if we backfill even if the raw data is no longer available. That's not implemented yet though, so backfill only updates the aggregate if the raw data is still around by re-computing the intermediate state of the aggregate off the raw data for affected buckets. For most cases that isn't actually an issue since most people have a longer data retention period than backfill horizon.

1 comments

Thanks for the answer! I'd love to know more :) Also, I'm not following, how you guys deal with issues with unique counts? For example, lets say you've got 100 unique visitors on Monday and 100 on Tuesday. The unique visitors for both days might be anywhere between 100-200 and averaging counts between days doesn't work.
Not sure about this specific implementation but normally you handle this with approximations that support merging. i.e HyperLogLog You can merge 2 HyperLogLog counters to maintain proper distinct counts.
Yep!

And in fact, that's exactly what TimescaleDB supports - things like hyperloglog to support approximate count distinct, including as part of continuous aggregates. [0]

This blog post - "How PostgreSQL aggregation works and how it inspired our hyperfunctions’ design" - provides a really nice description of how our the API design of some of our analytical functions are motivated by the ability to "split" processing into the "pre-aggregation" and "finalization" steps, with the blog post focusing on the example of percentile approximation. (I think it was on HN a while back as well.) [1]

[0] https://blog.timescale.com/blog/introducing-hyperfunctions-n...

[1] https://blog.timescale.com/blog/how-postgresql-aggregation-w...

Awesome, thank you!