Hacker News new | ask | show | jobs
by kr99x 1835 days ago
I'm "lucky" enough to deal with buggy hardware on a semi-regular basis (I start writing firmware before the hardware is finalized and run on prototypes), so I really do get bugs where the the input data and the logic are all completely correct and the hardware is at fault. You get to an add instruction with immediate data/no pointers, and somehow it gives you back bad data or hangs.

On the one hand, yay, not my fault! On the other hand, HELL to debug. On the worst hand, it dramatically increases my willingness to SAY it must be a hardware problem, which is not always the case!

2 comments

Two "fun" examples:

1) System trying to boot would hang at seemingly random points. Could never be pinned down to a particular instruction, but could be caught doing it when stepping through with attached hardware debugger. It just wasn't consistent and never made any sense. Hang on an add. Hang on a call and never reach the first line of the thing being called. The hang would always be relatively late in the boot, but that's all that could be found.

Eventually I got it. It would hang the first time a timer interrupt triggered, which would only happen after that interrupt was enabled something like halfway into the boot.

Turns out there were disabled cores and the system was waiting trying to park those cores before servicing the interrupt, but they'd never respond/ack/say "I parked" and so we'd hang.

Disable the interrupt and there was no problem.

2) Operating in Cache-As-RAM mode early in boot, no "real" memory, just the L2 cache mapped as memory. Two valid/available address ranges could not both be written to. Writing to 0xA and then 0xB, or 0xB and then 0xA, would hang the system. Data being written didn't matter. Writes didn't need to be back to back. Just couldn't play nice.

Knowing it's a hardware problem spoils the fun of trying to debug that. Bad cache, couldn't properly convert addresses to cache lines, wrapped back on itself and panicked. Solution - move and resize "usable" cache region to exclude the overlapping ranges.

Bus timing errors! Fun times!

Forgot a wait state? It'll probably work, on most chips!

Even better when suppliers fix, or add, bugs and don't tell you. Or change the firmware they are shipping on a part that's hanging off a UART. Or how about discovering that in the 21st century, one of your suppliers doesn't use source control for their firmware and every time they send you over a firmware blob it consists of some patches applied to whatever code happened to be laying around on some developer's machine!