Hacker News new | ask | show | jobs
by throwawaylinux 1557 days ago
> Makes you think, do Linux, Windows and Mac handle this properly? Honestly, I doubt it!

Rubbish. These kernels (well Linux and Windows) run on systems with hundreds even thousands of cores, on CPUs which are very weakly ordered, with a pretty reasonable level of reliability. A race like this will blow up immediately.

Linux handles this by requiring that a context switch operation includes a full memory barrier so switching off CPU0 has a barrier ordering prior stores on CPU0 with storing a field that implies the task can be migrated (it's not currently running), and switching on to CPU1 has a barrier ordering the load of that flag with subsequent loads from the task on CPU1.

EDIT: here - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

  * The basic program-order guarantee on SMP systems is that when a task [t]
  * migrates, all its activity on its old CPU [c0] happens-before any subsequent
  * execution on its new CPU [c1].
It's informally worded but "activity" basically means memory operations (but could include whacky arch and platform specific things to cover all bases), and "happens before" meaning observable from other CPUs, which is clear in context.
2 comments

Even if you're not following all the arguments involved, you can brighten your day by spending a few moments reading the documentation in this linux code (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...), which is a great example of how to document complex code.
Note that this specific bug seem to happen only after cache operations (i.e. something like CLFLUSH, CLZERO in x86 parlance). It is possible that these instructions on the Switch SoC require a different barrier either because of spec details or hardware bugs.