Hacker News new | ask | show | jobs
by mynameisvlad 1718 days ago
I don’t think anyone ever claimed the process itself is perfect. If it were, we obviously would never have any issues.

To be explicit here, by blaming the process, you are discovering and fixing a known weakness in the process. What someone would need to triple check for now, wouldn’t be an issue once fixed. That isn’t to say that there aren’t any other problems, but it ensures that one issue won’t happen again, regardless of who the operator is.

If you have to triple check that value X is within some range, then that can easily be automated to ensure X can’t be outside of said range. Same for calculations between inputs.

To take the overly simplistic triple check example from before, said inputs that need to be triple checked are likely checked based on some rule set (otherwise the person themselves wouldn’t know if it was correct or not). Generally speaking, those rules can be encoded as part of the process.

What was before potentially “arbitrary input” now becomes an explicit set of inputs with safeguards in place for this case. The process became more robust, but is not infallible.

But if you were to blame people, the process still takes arbitrary input, the person who messed up will probably validate their inputs better but that speaks nothing of anyone else on the team, and two years down the line where nobody remembers the incident, the issue happens again because nothing really has changed.

1 comments

The issue is that this view always relies on stuff like "make people triple check everything".

- How does that relate to making a config change?

- How do you practically implement a system where someone has to triple check everything they do?

- How do you stop them just clicking 'confirm' three times?

- Why do you assume they will notice on the 2nd or 3rd check, rather than just thinking "well, I know I wrote it correctly, so I'll just click confirm"?

I don't think rules can always be encoded in the process, and I don't see how such rules will always be able to detect all errors, rather than only a subset of very obvious errors.

And that's only dealing with the simplest class of issues. What about a complex distributed systems problem? What about the engineer who doesn't make their system tolerant of Byzantine faults? How is any realistic 'process' going to prevent that?

This entire trope relies on the fundamental axiom that "for any individual action A, there is a process P which can prevent human error". I just don't see how that's true.

(If the statement were something like "good processes can eliminate whole classes of error, and reduce the likelihood of incidents", I'd be with you all the way. It's this Twitter trope of "if you have an incident, it's a priori your company's fault for not having a process to prevent it" which I find to be silly and not even nearly proven.)