|
|
|
|
|
by learnfromstory
2527 days ago
|
|
Don't really agree that this list could have come about through discussions with engineers at Google, Facebook, etc. The more computers you have the less important it becomes to monitor junk like CPU and memory utilization of individual machines. Host-level CPU usage alerting can't possibly be a "must-have" if there are extremely large distributed systems operating without it. If you've designed software where the whole service can degrade based on the CPU consumption of a single machine, that right there is your problem and no amount of alerting can help you. |
|
For example: if you have a fleet of hosts handling jobs with retries, a bad job could end up being passed host to host killing each host / locking up each one as it gets passed along. And that could happen in minutes while replacing and deploying and bootstrapping a new host takes longer. So by the time your automated system detects, removes, and spins up a new host everything is on fire.