Hacker News new | ask | show | jobs
by veltas 1557 days ago
> And how do you even find / debug a bug like this?

As someone who has worked on cache code, I suspect it's quite possible they were just reviewing this code again and realised the potential hole. Or they were trying to track down some horrific bug and fixed this along the way (whether or not it caused it), reviewing anything to do with caching is probably worth doing because it's notoriously difficult to get right, especially with context switching involved.

Another possibility is that the bug is more deterministic than it looks, under the right conditions, and they managed to replicate it and analyse it in a debugger.

2 comments

In Computer Science there are only two hard problems: Cache Invalidation, Naming Things and Off By One Errors.
Even low-probability bugs will surface often enough if you give it enough potential times to do so. There are >100 Mn Switch'es out there, and the interrupts happens at least tens to hundreds of times a second when in use, so plenty of opportunities :)
Yep but can they reproduce it? When we say "low probability" we're acting like it's truly random, but in reality they could have stumbled across steps that reproduce it very frequently.
You don't necessarily need to reproduce this one, once noticed you can work through on paper where the timing problems could lie.

Noticing the issue in the first place is the big problem, as TFA says the side effects of this are likely graphical glitches that would not stop any shows and just get marked down as a hardware timing issue¹. Once noticed by someone with the skills to notice it while looking at the code for other reasons, it is obvious².

Proving the fix fully resolves the matter without adding others could be a detailed task, perhaps with multiple skilled people passing their eye over the result to try ensure there isn't another “oh, interesting” moment³ waiting to happen, but again doesn't necessarily mean needing to reliably simulate the exact situation.

[1] pick up another couple of devices, yep confirmed, it doesn't happen on these

[2] well, obvious to that someone with that skill, I'll not claim it is as obvious to such as myself!

[3] like one that may have been how this was found, assuming it was spotted in passing while working on that area for other reasons

Sometimes you can figure out the bug without reliably reproducing it if you have enough logs/stack traces etc.
a lot of bugs in the embedded world get fixed just by code inspection; I think most people who have done systems or embedded coding have casually noted bugs just by going through some code looking for something else or adding a new path or feature.