| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bd_at_rivenhill 4062 days ago
	I believe 100% in proactive error prevention as a means of building robust programs but, as you say, there are failures that cannot be handled this way; when the kernel fails (as in this case) or when the hardware fails (e.g. spontaneous bit flip error in the memory when not using ECC/ECM) there is no way to handle this. Given that this is the case, you must also make the system robust, and one element of that is being disciplined about communicating program state via mechanisms such as heartbeats. This is not a magic bullet, but produces much better results than a lackadaisical approach. I think that as much or more effort should be spent on system robustness as program robustness because much of the error handling code I've seen at the program logic level is overly complicated and under-tested; when in doubt, call abort() and let the overall system sort things out (and design your system so that this approach works, check out Netflix's Chaos Monkey).