Hacker News new | ask | show | jobs
by weland 3973 days ago
The hardest bug I ever tracked to date resulted from a combination of me being a n00b at the time and legitimately being hard. It was a stack thrashing bug on an RTOS that ran on a system without MMU. To make things a little worse, GCC support for that platform was still very early at the time, so GDB would occasionally become confused, and did not support watches; besides, everything had gotten big enough at the time that there was no way to compile the whole system with debug symbols and no optimizations; the image was stripped and optimized for size.

The bug wasn't easy to reproduce: all we saw was that, every once in a while, when queried over $wirelessprotocol, the system would begin answering with crap values (it was supposed to measure some physical quantities, and crap values = meaningless, as in negative active power and hundreds of kV on a mains line), and if you kept on pounding it, it would eventually start "acting funny" -- randomly toggling LEDs and handling commands that were never given in the first place -- before eventually crashing. The problem was very far removed from its core; at first, all I was debugging was "system begins answering with thrashed values after a while".

I was two days into it when a more experienced colleague (I was a junior developer at the time) stepped in to help me. We began suspecting a process was smashing another process' stack when, after removing module after module, the bug was still not clearly reproducible by a particular sequence of steps, but the behaviour it triggered became fairly uniform.

We decided a good way to test this assumption was to modify the context switching routine to dump the current top of the stack over a serial line; unfortunately, that introduced additional delays that prevented the bug from occurring, so it didn't help us. We figured, however, that the handler for $wirelessprotocol's query was in the process that smashed the other process' stack, so we modified that handler to send the top of the stack over wireless (this is where not having a MMU helped, ironically :-) ). The base of the other process' stack could be obtained by just tracing context switches.

Sure enough, if enough commands piled up, that process (which was running some pretty intensive stuff, including floating point operations, on a very resource-constrained system) would smash into the next one's stack, messing up its context's registers.

In retrospect, this wasn't necessarily a difficult bug per se: the concept is well-understood and the theory behind it is trivial. The biggest problem is that it challenges the fundamental way we debug programs: when the CPU starts doing crap, we assume we've instructed it to do crap, and it's (correctly!) following consistently bad instructions. In this case, the CPU ended up following random instructions.