| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by throwawaybyeeee 1557 days ago

Reminds me of a similar bug that I worked on a few years ago that led to my single-line contribution to xnu (apologies for the dissertation):

We had increasing reports of devices panicking because the kernel stopped draining a buffer, causing the buffer to fill. This particular buffer should never fill, so if it does -> panic.

The first problem was that this bug was getting 'hot'. The bug needed to be fixed yesterday, and with the number of internal panics being reported, it was looking like it might delay shipping the OS. I was getting pinged constantly, and was expected to give daily updates in a giant cross-org shunning, the "bug review board" or BRB.

The second problem was, of course, that all the code looked fine. (Spoiler: it was. Sort of.) The relevant drivers were handling synchronization properly and appeared to be race-free, memory management looked fine, no uninitialized variables, etc. No problem, we'll just reproduce it then...

The third problem was that the bug was extremely hard to reproduce. With a single device it could take weeks to hit a single occurrence. So I needed a lot of devices, and every repro had to count.

At this point it was clear that I needed some USB hubs, so off to Fry's (RIP). Two giant USB hubs, one Toblerone bar, and an abundance of charity from QA later, I had ~15 devices hooked to a computer. With this battery of devices I was reproducing the issue once every few days.

Reproducing the bug reliably was a breakthrough, but root-causing the bug still felt like a dim prospect. The cores from the panics showed no smoking gun (our drivers' state looked fine), and my kernel mods to add simple lockless tracing seemed to suppress the bug, in true heisenbug fashion. And of course you're never sure if it actually suppressed the bug -- maybe you just didn't wait enough days?

~6 weeks had passed, filled with BRBs, all-nighters, working weekends, and testing tons of theories, all to no avail. On a whim I decided to revisit my lockless tracing strategy and remove a memory barrier. Alas! The bug triggered and I had tracing data!

Digging into the tracing data, it turned out the problem wasn't in our drivers at all, but was actually in the kernel (IOKit) itself, IOInterruptController specifically. The problem was that IOIC was setting a flag and then immediately enabling interrupts via a MMIO write. With this logic, it was possible for another core to service an interrupt (since they were just enabled via the MMIO write), but still observe the old value of the flag, because there was no barrier between setting the flag and enabling interrupts. (Hence why the barrier added by my original tracing suppressed the bug.) Because IOIC read the wrong flag value, it entered a state that prevented interrupts from being serviced, and our buffer would fill and we'd panic. The fix was to simply add a memory barrier to IOIC between setting the flag and enabling interrupts.

To this day I'm still mystified as to why this bug hadn't caused broken interrupts (+ mysterious behavior) or mass panics before then. There must've been some other change to xnu that exposed the bug somehow, but I'll probably never know.

4 comments

rkangel 1556 days ago

> To this day I'm still mystified as to why this bug hadn't caused broken interrupts (+ mysterious behavior) or mass panics before then. There must've been some other change to xnu that exposed the bug somehow, but I'll probably never know.

My favourite species of software bugs are the ones that when you find the source you realise the code is fundamentally broken, and you get to investigate how the hell it worked for so long!

fps-hero 1556 days ago

Thank you for sharing, gosh that must have been horrific to track down.

There really is something about USB buses that cause the worst kinds of errors. I happen to know that the USB2 driver in the RPI 1/2/3/4 has a Linux kernel corruption bug which is completely masked by the use of a USB hub.

Why does it matter? Because the RPI 1, 2, 3 all hide the USB port behind a hub. Only the zero, and the a series have a naked port. Now, try searching the RPI forum for USB problems and start to notice a correlation.

The problem is that the hive mind decided that all USB errors must be power related, and given the complete dodgeyness of most RPI zero setups it was always assumed this was the culprit.

Unfortunately it isn’t. No amount of probing, decoupling, externally powering ever fixed the glitches, ah but yes, not using an official 3A RPI branded psi was definitely the issue, sorry Agilent your psu’s just aren’t up to task, probably the reason you had to “rebrand” in the first place. :S

We ended up retrofiting a USB hub in-line with a USB connector for a prototype, and we’ve since designed in the hub just for that one USB port used for USB data storage, which is brilliant at the moment because USB hubs ICs are unobtainium, so, we can’t make any more product, because of this software bug.

Every couple of months I would try the latest kernel, but all you needed to do was write to disk continuously and you would hit the bug in 4 hours max, 10 minutes on average. The best part is the kernel corruption kills the file system, we got a trace on a monitor (normally a headless system) but if you ssh’d in, you were dropped into and empty file system and you couldn’t run any tool to diagnose a problem, a simply tab competition would hang the shell. Fun times.

Never bother reporting the bug because I found hundreds of threads on the forums detailing similar issues. It is very uncool that to this day they still insist on using their bespoke driver instead of trying to mainline their performance fixes, otherwise everyone using the dwc2 ip would have befitted, and this bug would have been fixed with hundreds more eyeballs on the problem, not just the one USB guy at RPI towers.

akino_germany 1556 days ago

Avoid high-throughput devices is even in their list of known USB issues, so they must have some idea: https://www.raspberrypi.com/documentation/computers/raspberr...

saagarjha 1557 days ago

No apologies needed, this is why people visit Hacker News :)

mturmon 1556 days ago

Great anecdote, thanks for sharing your dissertation ;)