Hacker News new | ask | show | jobs
by AnimalMuppet 2243 days ago
I've posted this story before, but it fits here rather nicely.

I had a function that looked like this:

  void f() {
    bool flag = true;
    while (flag) {
      g();
    }
  }
This function would sometimes exit. But that's really all there was to the function. Somehow flag was becoming false, even though nothing ever wrote to it.

So you might think about g() smashing the stack, when a variable is mysteriously changing, but you'd expect the return address to also get written, and it wasn't - the function returned from g() to f(), found flag to be false, exited the loop, and returned from f().

Eventually I got desperate enough to look at the assembly code produced by the compiler, and I became enlightened. (This was g++ on an ARM, by the way.) flag was being stored in R11, not in memory. (Might have been R12 - it's been a while.) When g() was called, f() just pushed the return address. Then g() pushed R11, because it was going to have its own variable to stash there, and then created space for its stack variables. And one of those variables was smashing the stack by 4 bytes, over-writing the saved flag value from f().

Worse, the way the stack was getting smashed was on a call to mesgrecv(). This takes a pointer to a structure and a size, but the relationship between the two isn't what you'd expect. The size isn't the size of the structure, but rather the size of a substructure within that structure. A contractor had gotten that detail wrong when they used that mechanism for IPC between two chips. (They'd gotten it wrong on the sending side, too, so the data stayed in sync.)

The net result was that the flag got cleared when four next-door-but-unrelated bytes on another CPU were all zero. It took me a month, off and on, to figure that out.

1 comments

Crazy thing to go with that... if your compiled with different (more aggressive) optimization flags, it might have gone away!
It already went away when I tried to print out the address of the variable, so that I could watch it in the debugger (because, in order to take the address of it, it had to become a stack variable).
In the end, do you remember what tools you used to confirmed that R11 was overwritten? The tools and the path to the root cause are also quite interesting.
I first looked at R11 because of the assembly output. There are flags that you can give g++ to produce the assembly output when it compiles. That showed me that the variable was in R11, and where it wound up on the stack in the g() function body.

From there, it was a question of how g() was smashing the stack. (I hadn't looked at that before, because I assumed that it had to be f() smashing the stack in order to change the variable.) Well, the next thing on the stack was the structure for mesgrecv. If too much got read into it, it would overwrite the stored copy of R11. That led me to look very carefully at the mesgrecv call. Checking the parameters against the man page showed up the unexpected (at least to me) requirements for the size parameter.

I never "verified" that the stored copy of R11 was being overwritten, except by changing the size parameter and noting that the loop in f() never terminated any longer.