Hacker News new | ask | show | jobs
by TylerE 796 days ago
Strongly disagree. Having stored telemetry has helped me debug so many things.

Forever is probably too much, but keeping a month or so is totally sane.

1 comments

Why kind of things did you debug with CPU/Memory/Storage telemetry that you couldn't have debugged by only turning those things on after you knew there was a problem?
Identifying patterns where problems coincide with other processes or times, eventually tracking it down to a release done by another team.

It's happened to me a few times.

So your business metrics suddenly dropped, but what has changed?

This service is using 80% CPU, that seems a bit high... but is it always this high? Looks like it spiked within the last hour. But wait, it does that every Monday at 9 am, so probably a red herring.

This cache has a hit ratio of 60%... is that good? A bit low? Actually it's suspiciously high compared to last week - looks like a lot of people aren't getting a personalised feed.

Metrics are incredibly cheap to keep around for the value you get from a good operational dashboard, despite what Datadog/Amazon/Grafana Cloud tells you. It's just the most egregiously overpriced data you can buy since 20 cent text messages.

A good start is to set up VictoriaMetrics with some collectors and set retention to 14 days.

when storage is full, and you don't know about that, you can't release anything to enable the logs in first place.
You can poll storage periodically though, you don't need to keep a constant metrics stream of where it's at. Also you can set up each machine to alert when it's own storage fills up.

Also, as your storage hits 97%+, you'll probably start seeing effects in your business metrics, and then you can look into it.

I think that you are confusing real-time metrics, streamed with very high precision (below 1s) and metrics that are simply polled every N time (most use-cases).

real-time, high precision metrics aren't necessary. when you say that you don't need metrics and then say that you can poll metrics periodically, you are contradicting yourself.

I'm not contradicting myself. I'm saying you just poll for storage, you don't store the results. My entire thesis is that those metrics aren't worth storing.
crossing fingers that the process that is polling the storage doesn't crash in the future, so you won't be left in the dark, as there is no metric stored, so you will never know when things will go down the drain.
> You can poll storage periodically though, you don't need to keep a constant metrics stream of where it's at. Also you can set up each machine to alert when it's own storage fills up.

Unless you want to be able to have trends over time, either for capacity planning (needing to order more storage in case of bare metal, or planning costs ahead) or to correlate with other things (storage consumption is growing twice as fast since deployment X, did we change something there?).

You don't need to have 1s granularity metrics on storage consumption, but having none is just stupid levels of fake "optimisation" that will cost you more in the long run.