Hacker News new | ask | show | jobs
by sofixa 794 days ago
> If your app is constrained by CPU or RAM, then the business metrics will reflect that, and then you can turn on collection of those metrics to identify the problem.

After having annoyed how many users and lost how much revenue? Having metrics to identify brewing problems before issues start to arise (be they on arriving CPU, memory, disk, network constraints or increasing network latency which will soon but not yet show up in the business metrics) is valuable.

> I ran all of ops for reddit for four years and headed up SRE at Netflix, so I have some experience in large scale systems. Not that it should matter.

I have a hard time believing at either of those it was acceptable to have a problem ongoing for days without any idea what's happening because logs and metrics weren't enabled in the first place.