Hacker News new | ask | show | jobs
by jacquesm 1038 days ago
That's very much my experience: asymptotic reduction in bug incidence and matching increase in runtime between subsequent errors. To the point that they are so rare that you think you have fixed all of them. But that's usually an illusion: it's just that the error rate is now so low that you no longer observe incidents yourself. The only way to get past that hurdle is to do the bad thing: release and hope that you got it right and if you didn't that the incidents will not be too bad.

You could run many tests in parallel to reduce the chance but it will never be completely zero. Writing bug free software this way is hard. The better way is to design it from the ground up with a bunch of instrumentation that keeps all of your invariants under close observation and that stops the moment anything is not according to your assumptions. This usually gets you to a high level of confidence that things really do work as designed. But of course, that also isn't perfect and residual risk (and residual bugs...) will always remain in any system of even moderate complexity. File systems are well above that level, especially distributed file systems.

2 comments

I think there's room for a hybrid approach, not just fixing the bug a fuzz test found, but treating the fuzz test error as a bug in invariants (either inadequate or violated in a subtle way that's not caught) and addressing that underlying issue.
Yes, good point: often it is the assumptions themselves that were broken. So effectively you are debugging both the real system and the monitoring system.
> The better way is to design it from the ground up with a bunch of instrumentation that keeps all of your invariants under close observation and that stops the moment anything is not according to your assumptions.

Any personal "war stories" you are willing to share explaining how you went about designing such a system? :-) Or any presentation of someone who did that?

I wrote about retrofitting such a system onto an existing one:

https://jacquesmattheij.com/all-programming-is-bookkeeping/

Which was written in the context of:

https://jacquesmattheij.com/saving-a-project-and-a-company/

And without which that project likely would not have seen a successful end result.

It really is a war story, and what I like most about it is how at the drop of a hat a team assembled to get the job done, cleanly tackled the problem(s) and transferred duties to a new crew.