| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jandrewrogers 1321 days ago

I’ve written a few systems that were aggressively self-healing and operated them in production. The benefit is as you say. When done well, the systems kind of run themselves and require much less attention than systems that are not designed this way. From an operations perspective it was great. From a software development perspective, not so much, and this largely explains why it is uncommon.

In all typical software architectures, many places in the code do not have enough context to handle exceptional conditions. Single errors may have multiple possible root causes that have to be determined by inference or deduction in the code so that the handling is appropriate. Evaluating some causes requires complex code far outside the purview of the software’s main purpose and possibly skill set of the developers. Appropriate resolution of an exception at a single call site can be context dependent — not only do you have to determine the root cause at runtime, you also have to determine the correct resolution at runtime. A single resolution may need to implement multiple strategies to take into account real-time environmental context that change how that resolution is handled.

Making this logic maintainable requires an architecture that pretty heavily revolves around the software infrastructure required to make this type of exception handling scalable. You’re replacing all of the error handling idioms every software engineer knows with something alien that colors the entire code base. Also, there is little in the way of robust frameworks that do a lot of this grunt work for you so you are usually left writing your own.

The tl;dr is that implementation is quite expensive and difficult in practice, even though it usually has no performance overhead and is great from an operations perspective. While people like the idea, the software development overhead is usually considered too high to justify making operations’ life easier.