| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hyperman1 1321 days ago

I agree with part of this. When possible, code should either succeed or revert back to the state before the operation, before doing a throw. Avoid half-completed work if at all possible. This is in general quite easy by first doing work with higher chance on failure, and only then connecting that work to the rest of the system state. E.g. don't add an item to a list and then do something with it, do it the other way round and only add items to the list when something has already been done on them.

If you are in a known failure scenario, you can try known resolutions. If some necessary resource like a DB or file system disappeared, try to self-heal as soon as the resource reappears. But if you're in unknown territory, better stop working and complain. Writing a message in a log is good enough, IF some monitoring system will pick up that log and alert the right channel.

I've seen systems trying to self heal, but working on wrong assumptions. Operators now have 2 problems: Fix the system AND stop it from damaging itself. Self-healing can turn to self-damaging quickly.