|
|
|
|
|
by notacoward
2237 days ago
|
|
Here's the craziest one that actually happened to me. The company I worked for had installed what's best described as a mini-supercomputer (though we avoided the term) at a site in Boulder. We started getting reports of failures on the internal communication links between the compute nodes ... only at high load, late in the day. Since I was responsible for the software that managed those links, I got sent out. Two days in a row, after trying everything we could to reproduce or debug the problem, I got paged minutes after I'd left (and couldn't get back in) to tell me that it had failed again. Our original theory was that it had to do with cosmic rays causing bit-flips. This was a well known problem with installations in that area, having caused multi-month delays for some of the larger supercomputer installations in the area. But we'd already corrected for that. It wasn't the problem. What it ultimately turned out to be was airflow and cooling. The air's thinner up there, so it carries less heat. But it wasn't the processors or links that were getting too hot. It was the power supply. When a power supply gets warmer it gets less efficient. Earlier in the day or with shorter runs as we tried different things this wasn't enough to cause a problem. With it being warmer later in the day, continuous load for longer periods was enough to cause slight brown-outs, and those were making our links flaky. And of course it would always restart just fine because it had cooled down a bit. The fix ended up being one line in a fan-controller config. |
|
It's a long story but the gist is after multiple board swaps, realizing we'd isolated the panel as the fault, I noticed the goo and on a hunch checked it with a scintillator, deducing it was alpha when cardboard blocked it. Turns out the ultra-precious-metal IBM heat sink on the board had an open path that effectively channeled the alpha particles into one of those multi-chip carrier thingies, which featured exposed chips.
As for why I had a scintillator lounging in my desk at a portfolio management company, don't ask. Let's just note the iconic IT anti-hero of that era was the Bastard Operator From Hell, and leave it at that.