| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by strgcmc 2120 days ago

Let's try to tie together what you're talking about (auto-scaling/capacity), with the OP and blog post was mainly about (chaos engineering/engineering-for-failure). Imagine:

- You operate a service with significant traffic, and through empirical experience, you have a good handle on what 1x traffic looks like, and have even seen spikes to 2x traffic on rare occasion, which your overall system handled just fine. Applying your overall philosophy, you setup your system to allow for 5x the CPU resources you need, and call it a day, nothing to see here.

- But, guess what? Unbeknownst to you, your system has some critical bottleneck that would only surface at 3x your usual traffic, which could be anything from hitting some misconfigured max limits on your load-balancer, or exhausting all your database connections, or running out of threads or inodes on your server hosts, or triggering a kind of retry-storm/brownout due to slowly increasing latency in one of your service calls that only explodes past a certain limit (due to some unintended interaction with your core timeout/retry logic), or any number of latent potential bottlenecks that you never knew about, because as long as your system stayed under the critical limit, it was completely invisible to you. In other words, these are non-linear failures, that you cannot simply solve by extrapolating out with "1x traffic = 1x # of servers, 5x traffic = 5x # of servers".

- As a result, not only do you don't have nearly as much head-room for scaling up as you think you do, but ALSO when you do encounter such a failure, you cannot easily just "scale out" horizontally, because the failure mode itself is only exacerbated by horizontal scaling. When you encounter such failures that break some axiomatic assumptions you have about your system, it can be incredibly difficult/painful to reconcile, especially if you had no plans and no knowledge about these invisible/latent aspects of your system ahead of time.

Chaos engineering isn't about scaling at all, not really. It's about finding latent defects in your system, by actively probing your assumptions and seeing if your system behaves as you would expect. Using traffic to generate stress on the system is just one way to introduce some "chaos", but there are many other ways too (as covered in the article).

Of course, it's also true that systems need to reach a certain minimum level of complexity, before the ROI of introducing chaos engineering becomes really worth it. You need to have a complex-enough set of services, dependencies, or interconnected components that are likely enough to behave in non-obvious ways, that you have to do independent chaos engineering to test them effectively, rather than simply reasoning about their properties directly.

1 comments

fxtentacle 2120 days ago

I wholeheartedly agree with your last paragraph.

My experience is that I have yet to work with a company where this level of failure-proofing makes financial sense. Purchasing more hardware than needed is relatively cheap for most medium-sized companies, and it provides a fair level of protection against outlier accidents.

I'm aware that many people using cloud also ascribe to the 100% uptime mentality, but for most companies that is simply not needed. I mean even for Netflix or Amazon Prime Video, I wonder if 2 hours of unexpected downtime per year would really be enough to make anyone cancel their service. I myself at least have spent much more time than that trying to get HDCP graphics cards drivers, HDMI cables, and the stars to align so that the Netflix app will work with 4K HDR playback on my TV.

So yes, (your 2nd paragraph) I would knowingly accept that there are critical bottlenecks that are unknown and that could be triggered by severe traffic spikes. And most of my customers would be happy to accept that risk in exchange for the cost savings of not proactively fixing the issue.

And if you look at the overall state of software, it looks like pretty much every company is happy to trade reliability/resilience for cost savings these days. That's why I applaud the efforts in the original article, but the pragmatic way seems to be to just skip the whole thing.