|
|
|
|
|
by kr99x
1835 days ago
|
|
I'm "lucky" enough to deal with buggy hardware on a semi-regular basis (I start writing firmware before the hardware is finalized and run on prototypes), so I really do get bugs where the the input data and the logic are all completely correct and the hardware is at fault. You get to an add instruction with immediate data/no pointers, and somehow it gives you back bad data or hangs. On the one hand, yay, not my fault!
On the other hand, HELL to debug.
On the worst hand, it dramatically increases my willingness to SAY it must be a hardware problem, which is not always the case! |
|
1) System trying to boot would hang at seemingly random points. Could never be pinned down to a particular instruction, but could be caught doing it when stepping through with attached hardware debugger. It just wasn't consistent and never made any sense. Hang on an add. Hang on a call and never reach the first line of the thing being called. The hang would always be relatively late in the boot, but that's all that could be found.
Eventually I got it. It would hang the first time a timer interrupt triggered, which would only happen after that interrupt was enabled something like halfway into the boot.
Turns out there were disabled cores and the system was waiting trying to park those cores before servicing the interrupt, but they'd never respond/ack/say "I parked" and so we'd hang.
Disable the interrupt and there was no problem.
2) Operating in Cache-As-RAM mode early in boot, no "real" memory, just the L2 cache mapped as memory. Two valid/available address ranges could not both be written to. Writing to 0xA and then 0xB, or 0xB and then 0xA, would hang the system. Data being written didn't matter. Writes didn't need to be back to back. Just couldn't play nice.
Knowing it's a hardware problem spoils the fun of trying to debug that. Bad cache, couldn't properly convert addresses to cache lines, wrapped back on itself and panicked. Solution - move and resize "usable" cache region to exclude the overlapping ranges.