Hacker News new | ask | show | jobs
by sagichmal 2143 days ago
High fidelity operational metrics have a useful half life measured in days or weeks. Read patterns for longer term use cases are also categorically different. Best architecture is to have a separate system for long term stuff, which treats Prometheus as a data source. Then Prometheus can drop after 14-28d.
2 comments

> High fidelity operational metrics have a useful half life measured in days or weeks.

It depends on how high fidelity you're talking but in my experience retaining these metrics can be valuable, not only for viewing seasonal trends already mentioned in another reply but for debugging problems. It can be helpful to be able to view prior events and compare metrics at those times to a current scenario, for example as a part of a postmortem analysis. I do agree that the usefulness of old metrics falls off with time. Metrics issued from a system 3 years ago likely have little in common with the system running today.

> High fidelity operational metrics have a useful half life measured in days or weeks.

Depends on the metric IMO. There's a ton of use you can get out of forecasting and seasonality for anomaly detection, but you need data going back for that to have any chance. Many relevant operations metrics exhibit three levels of seasonality: daily (day/night) weekly (weekday/weekend) and annual (holidays, superbowls, media events). Being able to forecast network traffic inbound on a switch to find problems would require you to have 1y of data, effectively. You _might_ be able to discard some of the data but you'd lose some of the predictive capacity for say, the Super Bowl.

I agree that it's important to keep some telemetry data for the long term. My point is that you need fewer and less granular metrics for those use cases, and that the access patterns are sufficiently different from real-time operations, that they're most effectively served by two completely different systems.