Hacker News new | ask | show | jobs
by jedberg 795 days ago
NO! You don't!

I couldn't agree with the author more. Keeping historical records of business metrics makes a ton of sense. But history telemetry (CPU, Memory, Network, error logs) makes little sense.

If an issue occurs, then turn on telemetry around that issue until you track it down. If an issue occurs once and never again, did it really matter? This obviously does not apply to security, I'm just speaking of operational issues.

Keeping all of your application logs and telemetry forever is expensive, and I can't recall a single time when having more than a day's with of history was ever useful in tracking down an operational issue.

4 comments

I feel like this swings the pendulum a little too far to the other side. There's very little harm in having telemetry on at all times, but log rotate once a week/month/whatever works for you. If you have telemetry off to begin with, you might not even notice you have an issue while your users do.
You should have a ton of telemetry on business metrics. You would absolutely notice an issue before your users if you have those. For example at Netflix we monitored stream starts per second -- how often you hit play and it worked. That metric was the most important, and the one that triggered most investigations.

If your CPU and memory aren't affecting the business metrics, then it's not super relevant.

As someone who's been on a maintenance team for years, keeping monitoring (cpu, memory, disk, etc) for at least two weeks is critical, and I'd prefer 6 months to easily identify larger trends and prevent issues before they happen.
Very little harm... If the telemetry is from your users I'd like you to value them more than that.

Also consider the potential risks of handling personal data and leaks.

This only holds if you assume telemetry means personal data, but that is a very big if. Meta, Google and other giants generally deal in telemetry that includes personal data, however for most run of the mill software that's not the case. Outside of advertising, I would argue that for most applications you're already pretty close to being clear of personal data as long as you exclude the user's email and other identifiers from the logs. Sure, there are examples where this is not the case, but it isn't even remotely as big of a problem as you claim it to be.
A lot of telemetry can become personal data. Filenames etc. are the easy parts.

Telemetry needs to be motivated for it to not be considered spyware. You need to really consider what you are logging and why, and then, is it worth the downsides.

It is not something to take lightly, hardly "no harm".

Very few things are worth keeping after two weeks, I like short retention policies
> Keeping all of your application logs and telemetry forever is expensive, and I can't recall a single time when having more than a day's with of history was ever useful in tracking down an operational issue.

A day is a pretty small window, I'd say a week or a bit more is good enough for most orgs. That way you can compare specific endpoints/code between deploys, answering questions like "was this endpoint this slow last week too or did I break it?". Some issues take a few days to brew and having historical data is important in debugging. Many orgs don't do load testing at all or have any real performance analysis done before things crash.

Log retention is also directly tied to how fast and easily can you detect and recover from issues.

> Log retention is also directly tied to how fast and easily can you detect and recover from issues.

I disagree. Every issue I've ever debugged, I did a tail -f on the logs. I can't recall ever searching the old logs.

Even if it takes a few days for an issue to brew, usually the logs right now will show the issue. Or if they don't, then you can turn on the logs and have them in a few days time. It's so rare that it's almost never worth keeping the logs around just for that one case where an old log might lead to resolution, and rarely does one have time during an active incident to look at old logs anyway.

> I can't recall a single time when having more than a day's with of history was ever useful in tracking down an operational issue.

User writes into support 3 days after the problem occurred, and support goes back and forth covering level 1 possibilities for an additional 2 days before escalating. It's common for 1 support complaint to represent some larger factor of users who never complain, so it would be useful to understand how common the issue is once it has been identified in the observability data. Having one day isn't sufficient in this scenario.

I think you missed my key point -- I'm talking about operational metrics not business metrics. With business metrics you can get historical context, but I don't see how CPU/Memory/Storage/App logs will help you.
Here is a good piece on gaining value from long term operational metrics. https://danluu.com/metrics-analytics/
Yep, I missed that. I definitely have hit cases where having a larger window of infrastructure metrics has been very useful. Being able to correlate it against other observability factors can help to understand what caused a problem. But I agree that you don't have to keep it forever. I think a few weeks is fine, assuming the scale of the system doesn't mean that a few weeks is an unwieldy amount of data
if you don't have metrics for cpu/memory/storage how do you know when to scale the app, or when you are at limit of the storage? i feel like you have never touched servers/backend in anything more than simple projects (or at all). with full storage/memory there could be an issue that you won't be able to ssh to the server, so it speaks about your knowledge in this matter.

collecting user-identified telemetry is debatable (depends on the case), but not collecting anything at all is just plain stupid.

> if you don't have metrics for cpu/memory/storage how do you know when to scale the app

When the business metrics start to fail. You don't need constant metics on storage, you can poll it every so often. If your app is constrained by CPU or RAM, then the business metrics will reflect that, and then you can turn on collection of those metrics to identify the problem.

> i feel like you have never touched servers/backend in anything more than simple projects (or at all). with full storage/memory there could be an issue that you won't be able to ssh to the server, so it speaks about your knowledge in this matter.

I ran all of ops for reddit for four years and headed up SRE at Netflix, so I have some experience in large scale systems. Not that it should matter.

> If your app is constrained by CPU or RAM, then the business metrics will reflect that, and then you can turn on collection of those metrics to identify the problem.

After having annoyed how many users and lost how much revenue? Having metrics to identify brewing problems before issues start to arise (be they on arriving CPU, memory, disk, network constraints or increasing network latency which will soon but not yet show up in the business metrics) is valuable.

> I ran all of ops for reddit for four years and headed up SRE at Netflix, so I have some experience in large scale systems. Not that it should matter.

I have a hard time believing at either of those it was acceptable to have a problem ongoing for days without any idea what's happening because logs and metrics weren't enabled in the first place.

> If your app is constrained by CPU or RAM, then the business metrics will reflect that, and then you can turn on collection of those metrics to identify the problem.

…but why incur that round trip on my feedback loop? Having those metrics on doesn’t cost me much.

This feels potentially like the perspective of a large organisation with both mature monitoring systems and quite steady state user base activity (through scale). When I have a customer who had an issue yesterday because they had an unusual workload that won’t be repeated often, I can’t afford not to have had the basic metrics turned on, in case they point us in the right direction.

where you worked doesn't matter to me very much, when what are you saying contradicts what you probably did ("experience in large scale systems"), also it sounds like argument from authority.

not having cpu/mem/hdd metrics is just plain bogus and sounds like fantasy world, where everything works like we expect it to work, and there is no bugs at all. ridiculous

You question his competence.

> i feel like you have never touched servers/backend in anything more than simple projects (or at all). with full storage/memory there could be an issue that you won't be able to ssh to the server, so it speaks about your knowledge in this matter.

He was answering that.

If instead of dismissing someone outright and question their competence, you had raised specific concerns, this would have been a more productive conversation

> i feel like you have never touched servers/backend in anything more than simple projects (or at all)

I feel like if you are going to go out on a limb and call someone's expertise into question...

> I ran all of ops for reddit for four years and headed up SRE at Netflix

And they provide excellent credentials which you failed to check...

> where you worked doesn't matter to me very much

You can't just weasel out of it by pretending like you didn't start the interaction by calling someone's expertise into question.

Strongly disagree. Having stored telemetry has helped me debug so many things.

Forever is probably too much, but keeping a month or so is totally sane.

Why kind of things did you debug with CPU/Memory/Storage telemetry that you couldn't have debugged by only turning those things on after you knew there was a problem?
Identifying patterns where problems coincide with other processes or times, eventually tracking it down to a release done by another team.

It's happened to me a few times.

So your business metrics suddenly dropped, but what has changed?

This service is using 80% CPU, that seems a bit high... but is it always this high? Looks like it spiked within the last hour. But wait, it does that every Monday at 9 am, so probably a red herring.

This cache has a hit ratio of 60%... is that good? A bit low? Actually it's suspiciously high compared to last week - looks like a lot of people aren't getting a personalised feed.

Metrics are incredibly cheap to keep around for the value you get from a good operational dashboard, despite what Datadog/Amazon/Grafana Cloud tells you. It's just the most egregiously overpriced data you can buy since 20 cent text messages.

A good start is to set up VictoriaMetrics with some collectors and set retention to 14 days.

when storage is full, and you don't know about that, you can't release anything to enable the logs in first place.
You can poll storage periodically though, you don't need to keep a constant metrics stream of where it's at. Also you can set up each machine to alert when it's own storage fills up.

Also, as your storage hits 97%+, you'll probably start seeing effects in your business metrics, and then you can look into it.

I think that you are confusing real-time metrics, streamed with very high precision (below 1s) and metrics that are simply polled every N time (most use-cases).

real-time, high precision metrics aren't necessary. when you say that you don't need metrics and then say that you can poll metrics periodically, you are contradicting yourself.

I'm not contradicting myself. I'm saying you just poll for storage, you don't store the results. My entire thesis is that those metrics aren't worth storing.
> You can poll storage periodically though, you don't need to keep a constant metrics stream of where it's at. Also you can set up each machine to alert when it's own storage fills up.

Unless you want to be able to have trends over time, either for capacity planning (needing to order more storage in case of bare metal, or planning costs ahead) or to correlate with other things (storage consumption is growing twice as fast since deployment X, did we change something there?).

You don't need to have 1s granularity metrics on storage consumption, but having none is just stupid levels of fake "optimisation" that will cost you more in the long run.