Hacker News new | ask | show | jobs
by kjeetgill 2528 days ago
I'd say you should always have CPU monitored, but I get that you might night care to aggressively alert on it. It can be invaluable for hunting down root-causes after the fact: nothing's perfect from the first deployment. I single bad host is best if it crashes, but is a lot more dangerous if it's just wonky.

Things like CPU hopefully shouldn't be your key/gold service-up metric, but paradoxically, the more mature your system the more CPU can tell you; you can catch problems before they happen. It can help notice things like bad CPUs.

Memory stays pretty important in my experience; even more than CPU.

And in addition to all the other responses there are also different levels of pages: Some are page me at 5am, some can wait till morning, and some can wait till Monday. FAANG is more likely to have their own hardware so you actually get deeper/more diverse monitoring needs than a shop on AWS or something.

Source: FAANG-ish tier infra work