| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by learnfromstory 2527 days ago
	Don't really agree that this list could have come about through discussions with engineers at Google, Facebook, etc. The more computers you have the less important it becomes to monitor junk like CPU and memory utilization of individual machines. Host-level CPU usage alerting can't possibly be a "must-have" if there are extremely large distributed systems operating without it. If you've designed software where the whole service can degrade based on the CPU consumption of a single machine, that right there is your problem and no amount of alerting can help you.

7 comments

kevinsundar 2527 days ago

I work at a FAANG and host level cpu is most definitely an alert we page on. Though a single host hitting 100% CPU isn't really a problem in and of itself (our SOP is just to replace the host), its an important sign to watch for other hosts becoming unhealthy. It might be overkill but hey theres mission critical stuff at hand.

For example: if you have a fleet of hosts handling jobs with retries, a bad job could end up being passed host to host killing each host / locking up each one as it gets passed along. And that could happen in minutes while replacing and deploying and bootstrapping a new host takes longer. So by the time your automated system detects, removes, and spins up a new host everything is on fire.

link

learnfromstory 2527 days ago

Could you mention which FAANG so I can avoid applying or a job there? Large-scale software systems _must_ be designed to serve through local resource exhaustion. If you are paging on resource exhaustion of single host you are just paying the interest on your technical debt by drawing down your SREs' quality of life.

I stand by my beef with this article. The statement that "I've talked with engineers at Google [and concluded that a thing Google wouldn't tolerate is a must-have]" doesn't make sense. What I get from this article is you can talk with engineers at Google without learning anything.

link

kevinsundar 2527 days ago

Im not at liberty right now to name my employer but our systems are definitely designed to serve through local resource exhaustion. But we aren't talking about cheap hosts here. We generally run high compute optimized or high memory optimized hosts depending on the use case and if these generally powerful hosts hit 100% CPU or full memory utiliziation theres usually more going on than something random or simple so its important to have someone check it out.

link

packetslave 2527 days ago

A single host stuck at 100% CPU also has a nasty effect on your tail latency, in a system with wide fanout. If a request hits 100 backend systems, and 1 of them is slow, your 99th percentile latency is going to go in the toilet.

link

learnfromstory 2527 days ago

Which is a good reason to hedge and replicate but NOT a reason to alert on high CPU usage of single computers.

link

packetslave 2527 days ago

You definitely want to TRACK cpu usage on individual hosts, but, yeah, I would alert on service latency instead. Symptom, not cause.

link

jandrewrogers 2527 days ago

This very much depends on the kind of software system. If there is parallel orchestration going on, such as join operators in a scale-out database, the performance of a single machine in the cluster can impact the performance of the entire cluster. In fact, the software will often monitor this itself so that it knows when and where to automatically shed load.

link

stillworks 2526 days ago

Granted the article is a bit of a "Decaf-Soy-Latte", but in my experience, whatever that can be monitored should be monitored.

Software deliveries/releases can often realistically be non-perfect. (Don't have direct experience with Canary releases TBH though)

In case anything goes wrong any objective evidence which helps to reconstruct the failure scenario is valuable.

Also... Murphy's Law.

>If you've designed software where the whole service can degrade based on the CPU consumption of a single machine.

Typically, if such software is indeed released, I think it will be several CPUs on several hosts.

link

kjeetgill 2527 days ago

I'd say you should always have CPU monitored, but I get that you might night care to aggressively alert on it. It can be invaluable for hunting down root-causes after the fact: nothing's perfect from the first deployment. I single bad host is best if it crashes, but is a lot more dangerous if it's just wonky.

Things like CPU hopefully shouldn't be your key/gold service-up metric, but paradoxically, the more mature your system the more CPU can tell you; you can catch problems before they happen. It can help notice things like bad CPUs.

Memory stays pretty important in my experience; even more than CPU.

And in addition to all the other responses there are also different levels of pages: Some are page me at 5am, some can wait till morning, and some can wait till Monday. FAANG is more likely to have their own hardware so you actually get deeper/more diverse monitoring needs than a shop on AWS or something.

Source: FAANG-ish tier infra work

link

yibg 2527 days ago

Outliers are where the interesting stuff happens, and outliers happen to individual instances. Aggregates are useful but can be very misleading. You can have milliseconds 99 percentile latency with ~1% of requests timing out.

I wouldn’t alert on a single machine having CPU issues, but I’m definitely interested in a small collection of individual machines all having CPU issues at the same time.

link

steven2012 2527 days ago

This is an incorrect statement. CPU utilization and memory matters because it limits how many other containers you can load on the same host, and means that it becomes more and more expensive to run that particular service.

link

madhadron 2527 days ago

> If you've designed software where the whole service can degrade based on the CPU consumption of a single machine, that right there is your problem and no amount of alerting can help you.

Unless it's your database.

link

learnfromstory 2527 days ago

If you have "the database" then you're fucked anyway and probably your thing isn't on the scale that we are discussing.

link

cameronbrown 2527 days ago

If you're using a sharded SQL database then a single machine going bad could still affect thousands of people.

link