|
|
|
|
|
by kevinsundar
2527 days ago
|
|
I work at a FAANG and host level cpu is most definitely an alert we page on. Though a single host hitting 100% CPU isn't really a problem in and of itself (our SOP is just to replace the host), its an important sign to watch for other hosts becoming unhealthy. It might be overkill but hey theres mission critical stuff at hand. For example: if you have a fleet of hosts handling jobs with retries, a bad job could end up being passed host to host killing each host / locking up each one as it gets passed along. And that could happen in minutes while replacing and deploying and bootstrapping a new host takes longer. So by the time your automated system detects, removes, and spins up a new host everything is on fire. |
|
I stand by my beef with this article. The statement that "I've talked with engineers at Google [and concluded that a thing Google wouldn't tolerate is a must-have]" doesn't make sense. What I get from this article is you can talk with engineers at Google without learning anything.