| Hi! Author of the article here. The core concern is not about the capabilities of the compute abstraction being used (bare metal, containers or functions) or testing OS capabilities. The aim is to validate mitigations which are in place to counter turbulent scenarios (For example: massive spike in traffic, network outage, dependency is down, etc). These scenarios generally originate outside the given system. These kind of questions should be asked and systematically validated (quoting the article): * Have you tested how the system behaves when the underlying instances have a sustained CPU spike? * Is the system behavior understood under different stress? * Is there sufficient monitoring? * Have the alarms been validated? * Are there any countermeasures implemented? For example, is auto-scaling set up, and does it behave as expected? Are timeouts and retries appropriate? |
"Have you tested how the system behaves when the underlying instances have a sustained CPU spike?"
Since dedicated boxes are cheap, I'd just buy 5x the CPU resources that I reasonably need and call it a day. If there ever is a more than 5x traffic spike, then docker will prevent it from being a noisy neighbor, so the affected services will just become slower than usual. But even a 10x traffic multiplier would just produce a 2x slowdown, which should be tolerable for most users.
I agree that on clouds you want to save costs by only booking what you need. But bare metal, you can usually afford to keep spare capacity around all the time.
As such, I wouldn't plan for the system to behave well under stress. I'd try to always have enough resources around so that stress never happens. At the end of the day, this seems like a developer time vs. resource costs trade-off and for most companies, developers are sparse and resources are plentiful, so they'll have a very different trade-off from big FAANG companies.
"For example, is auto-scaling set up, and does it behave as expected?"
If your system is usually 90% idle, I wonder if you'll ever need that auto-scaling. Also, I'd say my customers can endure it if page load time goes up from 100ms to 200ms. So in my opinion, there is little need for auto-scaling for most companies.