Hacker News new | ask | show | jobs
by rezonant 795 days ago
> I can't recall a single time when having more than a day's with of history was ever useful in tracking down an operational issue.

User writes into support 3 days after the problem occurred, and support goes back and forth covering level 1 possibilities for an additional 2 days before escalating. It's common for 1 support complaint to represent some larger factor of users who never complain, so it would be useful to understand how common the issue is once it has been identified in the observability data. Having one day isn't sufficient in this scenario.

1 comments

I think you missed my key point -- I'm talking about operational metrics not business metrics. With business metrics you can get historical context, but I don't see how CPU/Memory/Storage/App logs will help you.
Here is a good piece on gaining value from long term operational metrics. https://danluu.com/metrics-analytics/
Yep, I missed that. I definitely have hit cases where having a larger window of infrastructure metrics has been very useful. Being able to correlate it against other observability factors can help to understand what caused a problem. But I agree that you don't have to keep it forever. I think a few weeks is fine, assuming the scale of the system doesn't mean that a few weeks is an unwieldy amount of data
if you don't have metrics for cpu/memory/storage how do you know when to scale the app, or when you are at limit of the storage? i feel like you have never touched servers/backend in anything more than simple projects (or at all). with full storage/memory there could be an issue that you won't be able to ssh to the server, so it speaks about your knowledge in this matter.

collecting user-identified telemetry is debatable (depends on the case), but not collecting anything at all is just plain stupid.

> if you don't have metrics for cpu/memory/storage how do you know when to scale the app

When the business metrics start to fail. You don't need constant metics on storage, you can poll it every so often. If your app is constrained by CPU or RAM, then the business metrics will reflect that, and then you can turn on collection of those metrics to identify the problem.

> i feel like you have never touched servers/backend in anything more than simple projects (or at all). with full storage/memory there could be an issue that you won't be able to ssh to the server, so it speaks about your knowledge in this matter.

I ran all of ops for reddit for four years and headed up SRE at Netflix, so I have some experience in large scale systems. Not that it should matter.

> If your app is constrained by CPU or RAM, then the business metrics will reflect that, and then you can turn on collection of those metrics to identify the problem.

After having annoyed how many users and lost how much revenue? Having metrics to identify brewing problems before issues start to arise (be they on arriving CPU, memory, disk, network constraints or increasing network latency which will soon but not yet show up in the business metrics) is valuable.

> I ran all of ops for reddit for four years and headed up SRE at Netflix, so I have some experience in large scale systems. Not that it should matter.

I have a hard time believing at either of those it was acceptable to have a problem ongoing for days without any idea what's happening because logs and metrics weren't enabled in the first place.

> If your app is constrained by CPU or RAM, then the business metrics will reflect that, and then you can turn on collection of those metrics to identify the problem.

…but why incur that round trip on my feedback loop? Having those metrics on doesn’t cost me much.

This feels potentially like the perspective of a large organisation with both mature monitoring systems and quite steady state user base activity (through scale). When I have a customer who had an issue yesterday because they had an unusual workload that won’t be repeated often, I can’t afford not to have had the basic metrics turned on, in case they point us in the right direction.

where you worked doesn't matter to me very much, when what are you saying contradicts what you probably did ("experience in large scale systems"), also it sounds like argument from authority.

not having cpu/mem/hdd metrics is just plain bogus and sounds like fantasy world, where everything works like we expect it to work, and there is no bugs at all. ridiculous

You question his competence.

> i feel like you have never touched servers/backend in anything more than simple projects (or at all). with full storage/memory there could be an issue that you won't be able to ssh to the server, so it speaks about your knowledge in this matter.

He was answering that.

If instead of dismissing someone outright and question their competence, you had raised specific concerns, this would have been a more productive conversation

> You question his competence. > He was answering that. > If instead of dismissing someone outright and question their competence, you had raised specific concerns, this would have been a more productive conversation

he first said that we don't need to monitor anything, just enable debugging when "business metrics" are failing, and then he changed his stance to "polling from time to time". that's just shows that his first take wasn't thoughtful, so I assumed that he never worked in "the field" or worked on smaller projects, as nobody that worked in bigger projects would say that "we don't need CPU/mem/hdd metrics". it's not like hes proposing something novel, that just ridiculous take that needs to be called out

> i feel like you have never touched servers/backend in anything more than simple projects (or at all)

I feel like if you are going to go out on a limb and call someone's expertise into question...

> I ran all of ops for reddit for four years and headed up SRE at Netflix

And they provide excellent credentials which you failed to check...

> where you worked doesn't matter to me very much

You can't just weasel out of it by pretending like you didn't start the interaction by calling someone's expertise into question.

> And they provide excellent credentials which you failed to check...

that's logical fallacy, you can work in any place on earth and still be wrong in the subject.

> You can't just weasel out of it by pretending like you didn't start the interaction by calling someone's expertise into question.

why? if his take is bad, then his job or experience doesn't change the outcome. i'm not an expert by any means, but things that hes saying just contradict everything that is standard practice and my own experience. based on that i'm able to say that he doesn't know what he's saying/proposing, and using his "excellent credentials" just make things worse, as it shows that he doesn't have an argument, just wishful thinking