| > The bug was triggered when your drive when into a very common thermal recalibration mode As the person responsible (alas), my specific recollection of this particular bug was that the root cause wasn't thermal recalibration, but rather UDMA signalling errors. Prior versions of Coherent using PIO mode had excruciatingly slow access, and when adding support for UDMA I also added support for the disk driver to recognise sequential access and issue multisector transfer requests; this boosted performance fairly massively, something like 3-4 times for some common things, and it was run for a fairly long time in-house and by beta testers with no trouble before it shipped. The problem though, was a small - literally one line - arithmetic error when the drive end of things reported a UDMA transfer error had occurred in the middle of a multisector operation; the error-handling code that set up a retry of the operation didn't compute the start kernel address correctly when a whole bunch of transfers had been merged (and some subset had worked). The primary problem with the UDMA modes was sensitivity to correct cable termination - see https://en.wikipedia.org/wiki/Parallel_ATA#Cable_select for some of that; basically, signal reflections from parallel ATA cable runs that didn't have terminating resistors made things electrically marginal and some systems would have really excessive numbers of UDMA CRC faults as a consequence, and given sufficiently high error rates and really bad timing that could end up polluting the buffer cache with stuff that was skewed by a sector :-( The big thing (on top of not having any in-house hardware that triggered this specific bug) was the sheer volume of work required for those releases, since getting from what was basically a fairly vanilla Seventh Edition UNIX to where it needed to be to start running large pieces of third-party code expecting POSIX was a big lift. Since there weren't many people, everyone was having to wear lots of hats; for instance, aside from kernel work I did a huge amount of work for POSIX.1 and .2 compatibility and on top of doing the underlying code changes (which ranged all over the system, particularly for some of the stuff we ran into Autotools scripts relying on) all of those needed documenting, too. [ Fred Butzen did amazing work writing the actual manpage text and making it really easy to understand - he justly deserved the credit for the quality of the manual in terms of its readability. But the scale of the changes needed to bring so many parts and pieces from V7 to POSIX meant lots and lots and lots of work trying to iterate over docs for technical accuracy at the same time as having to redesign all the affected parts and pieces. It was, in a word, exhausting. ] |
PC hardware was all over the map in those days. I didn't remember this bug being tied to cabling but I only worked it to the point where we recognized that the cause was not handling an error in multi-sector transfers correctly. I do remember putting Scatter/Gather handling into the SCSI driver so that SCSI drives could do the same multi-sector trick. I also dimly remember that Louis Gilberto had to patch my driver for a bug afterwards and Hal said that he didn't have kind words for me.