Hacker News new | ask | show | jobs
by GavinB 5142 days ago
This may be a tortured analogy, but it boils down into a basic problem:

1. You know there's a bug

2. You can't reproduce it

Several next steps come to mind:

1. Hire an outside expert who's dealt with this sort of thing before. They may be able to theorize what's going on and come up with a solution.

2. Install measures that don't prevent the problem but prevent the damage. For example, an emergency failsafe that shuts down the system or relieves the pressure when the incident occurs, thereby preventing the damage. This is why we electricity has fuseboxes! Error management is sometimes the only option, because 100% error prevention is impossible.

3. Install monitoring that tracks a lot more details then you are currently getting. When the next error occurs, you will know a lot more and may have the information needed.

Edit: What's the name of the theory in networking that 100% error prevention is not possible, so error handling is the only option? There was a great article on HN about it a few years back.

1 comments

My experience has almost always led to #3 being the most workable solution, but not a perfect one.* #2 should be incorporated into any project, but it presumes that you know all possible ramifications of incorrect operation. An electrical breaker works because complete non-operation is generally better than death. For many software companies, complete non-operation is a precursor to death.

#1 is almost never a good solution, namely, the amount of time it would take for them to become familiar enough with the codebase to not aggravate your existing engineers would exceed several iterations of #3, and also because I've rarely met an outside expert whose solutions didn't involve re-writing everything to meet their expectations of "correct implementation," this could be a sample selection problem on my part, however.

* - How do you know that you are monitoring the correct component? This path usually leads to multiple monitoring development tasks as you find where you thought the problem was sourced was a in fact symptom, and you continue adding more monitoring options as you get closer to the source. This is why I almost always add an insane level of logging to any application, and control the verbosity through runtime controls.

Brief non-operation (reboot / service restart) is often better than a prolonged outage. Particularly where SLAs are set to create an expectation and acceptance of this, and where redundancy exists.

I'm thinking too that there's a feedback process at work here, and some sort of damping mechanism would help with that.

Agreed, and many architectures are designed to have components "transparently fail" without impact to overall operation. When you have forced failures, feedback/damping is absolutely required. However, (my experience dictates) that most such failures are unplanned and unknowable at the outset, and you can only dampen conditions which are predictable.