Hacker News new | ask | show | jobs
by murdoze 1165 days ago
We have discovered a critical bug in QNX 6 kernel in a networking scenario. There is no workaround, since the bug was in the core of their message passing infrastructure - a non-blocking by design kernel call, SendPulse(), sometimes blocks.

It took me 9 months talking to them about this problem until I managed to reproduce it on just two nodes and half a page of code, and record kernel logs that clearly showed a race condition.

We have received a patched kernel in a few days, and it worked like that for a while. This fix was merged into the official release after almost two years.

After that - only Linux, where we can see and fix stuff. No proprietary code and bureaucracy, no "fast, robust and reliable" operating systems.

2 comments

There is no safety-certified Linux. As far as I know there was no safety-certified QNX 6 either (QOS 1.0 was based on QNX 6.5 SP1, which is not the same as QNX 6 despite the numbers looking eerily similar).

With a safety-certified system, you do not receive a patch because it violates the safety certification. Of course, you can get a patch and use it but then you're responsible for safety-certifying the entire stack including the closed-source vendor code, and best of luck.

Without this patch QNX was simply not usable, since it had a random race condition in the kernel, which just broke everything.

They have incorporated the patch into some QNX 6.5 version, but we had to deliver the product long before they did this.

This is exactly my point. Also, even if there is a workaround, more often than not the complexity of the mountain of workarounds just creates the next set of certified bugs.
It's a valid point, but the solution is not obvious. It's a trade-off in a big design space. (Of course with software it seems "trivial" to make sure the certification can be done quickly and cheaply. Just automate it! Unfortunately we're not there yet. :/ )

See also this comment: https://news.ycombinator.com/item?id=35589690