Hacker News new | ask | show | jobs
by jplevine 2060 days ago
Hi, I'm the product manager for Cloudflare Analytics. Thanks for this thorough and thoughtful review.

We are totally serious about building a world-class, privacy-first, free analytics product. At risk of HN cliche, this is our "early work". We are actively working to fix many of the rough edges mentioned here; if we had waited to fix all of them before shipping, we never would have shipped!

For folks who haven't seen it, I suggest checking out our launch blog post[0] which gives some more context around edge vs browser analytics (spoiler: we do both!), why we count visits the way we do, and how we handle bot traffic.

We know we have work to do on the "jagged lines" problem. For some low-traffic websites, we might show noisier, low-resolution data than is ideal. (We've artificially constrained our analytics to query a maximum of 7 days at a time because this problem is exacerbated with longer time ranges.)

My colleague Jamie wrote a nice blog post about how and why we sample data [1]. In short: we have an existing customer base of 25 million+ Internet priorities, whose traffic volume spans 9 orders of magnitude! Sampling data is an elegant approach that allows us to serve fast, flexible analytics for all our customers. Sampling shouldn't be feared, but we know we can do better in some cases. We've recently merged some deep-in-the-weeds improvements to ClickHouse [2] that should result in improved resolution. And we're currently working to store full-resolution data for the smallest websites.

Happy to address any other specific points that folks have questions about.

[0] https://blog.cloudflare.com/free-privacy-first-analytics-for... [1] https://blog.cloudflare.com/explaining-cloudflares-abr-analy... [2] https://github.com/ClickHouse/ClickHouse/pull/14221

2 comments

> Sampling shouldn't be feared

Well I would say the opposite, sampling should absolutely be feared. In a lot of case sampling is not an issue, home page, or popular page but in others, including checkout pages, , product pages, and low visibility pages, sampling can make massive difference. When working with sampled data you should always keep it in mind

I think his point is that those pages likely shouldn't be sampled (if they're truly low visibility and therefore don't have much traffic).
But would that happen on a per-page level or a per-site level? I think probably the latter, in which case the data is going to be a lot less useful where arguably it's most important.
> In short: we have an existing customer base of 25 million+ Internet priorities, whose traffic volume spans 9 orders of magnitude! Sampling data is an elegant approach that allows us to serve fast, flexible analytics for all our customers.

I've been reflecting recently on how problems like this only exist for companies with extreme scale (similar to how microservices came about to solve FAANG-sized problems). This is a non-issue if you go with a product like plausible (or my personal choice: GoatCounter) for your analytics, because in that case you're essentially just paying them to manage an instance of their open source software for you on a multi-tenant server (I'm guessing here). And if it does eventually become a scale problem for plausible to the point where they start complicating their architecture to solve it, you can self-host or switch to another plausible provider.

If you set out to solve a simpler problem, you can use a simpler solution.