Hacker News new | ask | show | jobs
by vardump 1442 days ago
Well, there's software that can cause some degree of harm. For example through servos controlling something physical. While you still probably can't catch all of the issues, you damn better try as hard as you can within reason.

I'd also wish for similar rigor from people developing whatever filesystens my data is on. :-)

Fail fast is generally a good idea, if you can do it safely.

4 comments

If you can't fail safely, you better review your entire architecture.

Software fails, you can make failures rarer, but you can't make they go away. You have to deal with it, it's not an option.

It's all really about risk management. Things can (and will) go wrong, and it doesn't only apply to software.

This involves a lot of thinking and collecting information about potential risks and evaluating their probability and severity.

Then you just mitigate the worst risks, probability times severity (other factors are also possible). Some residual risk always remains.

I think the idea is that there are error recovery semantics that:

1. Determine the last sane state of the system, and work forward from there. (Read the servo position and try to go from there)

2. Have a the "recovery" routine to reset the system. (Take all positions to "zero")

3. Just stop. (Yes, I know this can be bad). And ask a human for help.

If feasible, electromechanical methods are good.
> I'd also wish for similar rigor from people developing whatever filesystens my data is on. :-)

Stable storage is a key factor in making this philosophy work. [1]

[1] https://qconlondon.com/london-2012/qconlondon.com/dl/qcon-lo...

"Litter the code with aborts and test the ever-loving hell out of it" is more or less the strategy we use with flight software.