| Reminds me of a similar bug that I worked on a few years ago that led to my single-line contribution to xnu (apologies for the dissertation): We had increasing reports of devices panicking because the kernel stopped draining a buffer, causing the buffer to fill. This particular buffer should never fill, so if it does -> panic. The first problem was that this bug was getting 'hot'. The bug needed to be fixed yesterday, and with the number of internal panics being reported, it was looking like it might delay shipping the OS. I was getting pinged constantly, and was expected to give daily updates in a giant cross-org shunning, the "bug review board" or BRB. The second problem was, of course, that all the code looked fine. (Spoiler: it was. Sort of.) The relevant drivers were handling synchronization properly and appeared to be race-free, memory management looked fine, no uninitialized variables, etc. No problem, we'll just reproduce it then... The third problem was that the bug was extremely hard to reproduce. With a single device it could take weeks to hit a single occurrence. So I needed a lot of devices, and every repro had to count. At this point it was clear that I needed some USB hubs, so off to Fry's (RIP). Two giant USB hubs, one Toblerone bar, and an abundance of charity from QA later, I had ~15 devices hooked to a computer. With this battery of devices I was reproducing the issue once every few days. Reproducing the bug reliably was a breakthrough, but root-causing the bug still felt like a dim prospect. The cores from the panics showed no smoking gun (our drivers' state looked fine), and my kernel mods to add simple lockless tracing seemed to suppress the bug, in true heisenbug fashion. And of course you're never sure if it actually suppressed the bug -- maybe you just didn't wait enough days? ~6 weeks had passed, filled with BRBs, all-nighters, working weekends, and testing tons of theories, all to no avail. On a whim I decided to revisit my lockless tracing strategy and remove a memory barrier. Alas! The bug triggered and I had tracing data! Digging into the tracing data, it turned out the problem wasn't in our drivers at all, but was actually in the kernel (IOKit) itself, IOInterruptController specifically. The problem was that IOIC was setting a flag and then immediately enabling interrupts via a MMIO write. With this logic, it was possible for another core to service an interrupt (since they were just enabled via the MMIO write), but still observe the old value of the flag, because there was no barrier between setting the flag and enabling interrupts. (Hence why the barrier added by my original tracing suppressed the bug.) Because IOIC read the wrong flag value, it entered a state that prevented interrupts from being serviced, and our buffer would fill and we'd panic. The fix was to simply add a memory barrier to IOIC between setting the flag and enabling interrupts. To this day I'm still mystified as to why this bug hadn't caused broken interrupts (+ mysterious behavior) or mass panics before then. There must've been some other change to xnu that exposed the bug somehow, but I'll probably never know. |
My favourite species of software bugs are the ones that when you find the source you realise the code is fundamentally broken, and you get to investigate how the hell it worked for so long!