| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cameronh90 1515 days ago

It doesn't always require expanding the scope of work, but very often does. I even suggested a few situations where it doesn't, but in many cases fixing the true underlying problem involves expanding the scope of work.

It's hard to argue the nitty gritty without examples so here's a real world one from quite a long time ago, in a company that went bust after the death of the owner.

We had a system that had a significant quantity of code written in a custom language that would be compiled by an internally written compiler. This compiler was in some ways a work of genius, written in the 80s, but it had a lot of very deep architectural flaws in the optimiser that meant certain patterns of code would generate invalid output. We didn't write much new code in this language but had a pretty large body of code that needed to continue running.

So during a server hardware refresh, we found that almost everything was crashing. Turns out, a compiler optimiser flaw meant that any time a loop had a number of iterations that wasn't a multiple of the number of CPUs, generated programs would segfault.

We investigated what it would take to fix the underlying issue but it would have been a week or more of work just to understand why it was happening. Porting all the old code would have taken even longer.

Instead what we did was, using a pre-existing AST manipulation library we had written, add a prebuild script that hacked all of the files to include a CPU count check then pad out the number of iterations with NOPs. Took a few hours and unblocked the server upgrade.

Another, perhaps less esoteric and more recent example:

A third party open source library we use had an issue where a particular function call would sometimes get stuck in an infinite loop due to incorrect network code in the library interacting badly with our network hardware.

We submitted a bug report and fix, but maintainer wouldn't accept a fix unless we also changed a bunch of other related code, added a bunch of tests etc. which we didn't have time to do. We considered a fork but that would involve keeping it up to date, rebuilding packages and so on.

We worked around the issue by running it in a different process and monitoring CPU usage. If CPU usage goes beyond q certain threshold, we kill the process and try again.

Workaround was quick and has been working fine for over a year now. Contributed patch is still languishing in an open PR with various +1s from other users.

1 comments

pwr-electronics 1515 days ago

I think your examples agree with my point: You found minimal-time solutions that haven't caused continuous suffering afterwards, and can be easily removed when the root cause is fixed. That's a good result.

link