Hacker News new | ask | show | jobs
by skorgu 1704 days ago
I'm curious how this can both avoid the average-of-averages problem (presumably by using the original full-rate data to compute multiple aggregates) and also supports backfilling. Is there a danger of the full-rate data expiring and having a different behavior for backfills past that horizon? Or am I wholly misunderstading both these features?
2 comments

(NB: post author here)

Great question. We support average of averages by storing the intermediate state of the aggregate (for average that's the sum and count) so we could cleanly re-aggregate.

Eventually, we'll be able to incrementally update the aggregate if we backfill even if the raw data is no longer available. That's not implemented yet though, so backfill only updates the aggregate if the raw data is still around by re-computing the intermediate state of the aggregate off the raw data for affected buckets. For most cases that isn't actually an issue since most people have a longer data retention period than backfill horizon.

Thanks for the answer! I'd love to know more :) Also, I'm not following, how you guys deal with issues with unique counts? For example, lets say you've got 100 unique visitors on Monday and 100 on Tuesday. The unique visitors for both days might be anywhere between 100-200 and averaging counts between days doesn't work.
Not sure about this specific implementation but normally you handle this with approximations that support merging. i.e HyperLogLog You can merge 2 HyperLogLog counters to maintain proper distinct counts.
Yep!

And in fact, that's exactly what TimescaleDB supports - things like hyperloglog to support approximate count distinct, including as part of continuous aggregates. [0]

This blog post - "How PostgreSQL aggregation works and how it inspired our hyperfunctions’ design" - provides a really nice description of how our the API design of some of our analytical functions are motivated by the ability to "split" processing into the "pre-aggregation" and "finalization" steps, with the blog post focusing on the example of percentile approximation. (I think it was on HN a while back as well.) [1]

[0] https://blog.timescale.com/blog/introducing-hyperfunctions-n...

[1] https://blog.timescale.com/blog/how-postgresql-aggregation-w...

Awesome, thank you!
You avoid average-of-averages by storing multiple summaries. For example, you don't compute and store average, you compute and store sum and count.