Hacker News new | ask | show | jobs
by dbaggerman 2670 days ago
It's a different approach to managing risk -- minimizing impact of failure rather than minimizing the likelihood of failure.

It's nice to know that you can kill a process and the only impact is that in-flight requests fail, rather than having a more significant outage if a process crashes and the failover doesn't work, or the process doesn't automatically restart, etc.

If you accept that requests will fail you can build retries into the system. It's a lot harder to make a system more resilient if you avoid testing the failure scenarios.

2 comments

Exactly! Chaos engineering is all about thoughtfully planned out experiments, to observe what the user experience will be when something fails. Doing this on your own terms allows you to improve the experience so that your customers aren't affected.

You can decide what happens when an in-flight request is dropped, whether you hold onto the state somehow and retry or the client could fail gracefully with a relevant error message.

Another thing that's not often caught by "normal" testing but that chaos engineering can capture is when multiple things fail together in random ways. It can be surprising how otherwise robust services can fail badly when multiple things go wrong at once.