Hacker News new | ask | show | jobs
by site-packages1 2143 days ago
What do you all do with the collected metrics over time? Do you store everything forever, drop everything after a couple weeks, or something on between? I've heard of people thinning out old data a bit (?) and storing it long term rather than storing everything. What's the usual thing people do?
4 comments

High fidelity operational metrics have a useful half life measured in days or weeks. Read patterns for longer term use cases are also categorically different. Best architecture is to have a separate system for long term stuff, which treats Prometheus as a data source. Then Prometheus can drop after 14-28d.
> High fidelity operational metrics have a useful half life measured in days or weeks.

It depends on how high fidelity you're talking but in my experience retaining these metrics can be valuable, not only for viewing seasonal trends already mentioned in another reply but for debugging problems. It can be helpful to be able to view prior events and compare metrics at those times to a current scenario, for example as a part of a postmortem analysis. I do agree that the usefulness of old metrics falls off with time. Metrics issued from a system 3 years ago likely have little in common with the system running today.

> High fidelity operational metrics have a useful half life measured in days or weeks.

Depends on the metric IMO. There's a ton of use you can get out of forecasting and seasonality for anomaly detection, but you need data going back for that to have any chance. Many relevant operations metrics exhibit three levels of seasonality: daily (day/night) weekly (weekday/weekend) and annual (holidays, superbowls, media events). Being able to forecast network traffic inbound on a switch to find problems would require you to have 1y of data, effectively. You _might_ be able to discard some of the data but you'd lose some of the predictive capacity for say, the Super Bowl.

I agree that it's important to keep some telemetry data for the long term. My point is that you need fewer and less granular metrics for those use cases, and that the access patterns are sufficiently different from real-time operations, that they're most effectively served by two completely different systems.
7 day retention in Prometheus, pushing to something like VictoriaMetrics for downsampling and long term storage. Prometheus is great for collection but rubbish for managing large data sets
One thing you can do is configure compressions, so essentially less recent data has lower time resolution and/or less cardinality. So some dimensions are dropped and you only have e.g. 1h resolution for data older than some threshold.
Depends on your needs really. Some metrics we do (for now) keep indefinitely. We're using Thanos to ship data to bucket in object storage Some metrics we do keep for two weeks only.
Then that's business data not monitoring data. Whole different use case and tools. One loses its value over time, the other doesn't.