Hacker News new | ask | show | jobs
by broodbucket 1557 days ago
>Makes you think, do Linux, Windows and Mac handle this properly? Honestly, I doubt it!

Context switches, idle state transitions, etc tend to be fairly delicately handled as a common cause of CVEs and Heisenbugs. I'm sure there's still plenty of bugs but more attention ends up being paid to these things on general purpose operating systems. More eyeballs on the code, more security researchers, more hardware variants to expose things that were thought to be fine. Also fuzzing.

6 comments

For the record: Returning from an interrupt is fully serializing on x86, which I believe means that all modern operating systems on x86 will handle this properly.

https://pvk.ca/Blog/2019/01/09/preemption-is-gc-for-memory-r... is a very good blog post about exploiting this for a high-performance membarrier daemon.

But a thread can be context switched via a system call, so switching back to it will return to it via a return from system call which is not necessarily serializing on x86. Also memory operations and barriers executed in kernel mode have to be ordered correctly within a thread and context switches there can happen cooperatively so there may be no interrupt at all.

Also a serializing operation is not a memory barrier. It serializes execution of operations in the pipeline, not necessarily coherency operations after completion.

x86 is mostly TSO except possibly some cases of nontemporal stores and write combining memory types. I don't know the minutiae of the ISA and implementations any more but IIRC it could be possible that stores in a a write combining buffer can be visible out of order.

> But a thread can be context switched via a system call, so switching back to it will return to it via a return from system call which is not necessarily serializing on x86.

System call entry is under the control of the thread though. If your sequence of operations fails due to a missing barrier and being rescheduled after a system call, it might be a kernel bug, but you can fix it by putting in the barrier and it would likely be fixed on the switch via game patches and maybe a quality control check that the troublesome sequence is not present in the code.

Linux has barriers in the scheduler for this and other reasons.
Right, but it's possible that this is just a side effect of x86 doing TSO, so any stores are visible just by executing the instruction, so I'm not convinced (at least by that snippet) that interrupts have to be fully serializing. If they're not, I wonder if it's possible to observe load-store reordering across a privilege boundary…
I'm not sure what you mean; it is documented in the Intel instruction set reference that IRET is a serializing instruction. It is not only the case that it happens to appear that way due to stores by chance always being retired before the interrupt returns.

It also references Section 8.3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, which also says IRET is a serializing instruction.

Sorry, I forgot to quote the part that I wanted to pick on:

> Interrupt processing (returning from an interrupt handler, actually) is fully serialising on x86, and on other platforms, no doubt: any userspace instruction either fully executes before the interrupt, or is (re-)executed from scratch some time after the return back to userspace.

Approaching this from the perspective of not knowing that iret is serializing, this quote doesn't really provide a good reason for why it should be serializing. You've mentioned the manual, which is probably what should've been quoted, because the thing I extracted out goes on to say that it must be true on other platforms too when it evidently is not. The rest of my comment was a hypothetical on how you might observe if iret was non-serializing, which (given what the manual says) must be unobservable.

FWIW: I re-read this a bunch of times, and I don't understand how this isn't a hardware bug. How can an asynchronous interrupt be specified in any rigorous way if it does NOT act as a memory barrier to the interrupted code? Clearly the CPU isn't going to cache its in-flight state for every interrupt (and remember interrupts can be themselves interrupted!). So certainly "most" of its state is being serialized. And we're supposed to magically guess on a per-IP basis which state isn't? Yikes.

I mean, the fix is the same. But arguing about which OSes "handle this properly" is missing the point. The question to ask is which core IPs (and which configurations thereof, remember the Tegra in question has both A53 and A57 cores) require barriers on interrupt entry, and under what circumstances. If ARM isn't going to publish that errata then asking for OS authors to magically figure it out is just asking for bugs.

Knowing the Tegra chip in question, I'd bet it's probably not ARM's fault. Tegra X1, unlike pretty much every other SoC, did big.LITTLE via cluster migration with a custom cache coherence system, instead of just having a bunch of heterogenous cores. It turns out that their custom cache coherence was unfixably broken and would randomly corrupt memory when doing migration between the big and little cores, so everyone was forced to just entirely disable one set of cores. At least NVIDIA managed to fix this one in software?
You are most likely right. I would expect that barriers would be needed unconditionally. This patch is likely working around some hole in the SoC where the usual barrier is not sufficient for some instructions and some additional ritual need to be performed.
> FWIW: I re-read this a bunch of times, and I don't understand how this isn't a hardware bug. How can an asynchronous interrupt be specified in any rigorous way if it does NOT act as a memory barrier to the interrupted code? Clearly the CPU isn't going to cache its in-flight state for every interrupt (and remember interrupts can be themselves interrupted!). So certainly "most" of its state is being serialized. And we're supposed to magically guess on a per-IP basis which state isn't? Yikes.

Memory barriers only concern interactions with other agents that access memory.

Any given thread of execution is always consistent with respect to itself, including when taking interrupts.

"Serialized" is also not the same as a barrier and is not really related to memory consistency. Serialization does only matter within a single thread of execution.

Most OS vendors will just put a full barrier there and be done with it. Switch doesn't for some reason (and still doesn't, for some reason…not entirely sure why, maybe their context switch is ridiculously fast and the extra barrier would be very expensive?) but when you don't do that you run into problems like these. What's harder is processors storing microarchitectural state that you don't know about, but (post Spectre and Meltdown) CPU vendors have realized that this is not a great idea and started fixing it in microcode and/or hardware.
It makes sense to me you might need a memory flush when doing a core migration. Everything is still coherent from a single core perspective. But if you point a different core at the same PC, well, maybe it sees things differently that haven't been flushed.
ARM caches are (normally) coherent, though. That's not the issue here. It's something about the instruction reordering playing badly with the cache flush and interrupt entry hardware. And my point was that's a hardware errata and not a software bug.

And there seems to be no link to docs from ARM describing the issue, which is IMHO kinda horrifying.

I don't see anything about instruction reordering? Where'd you get that from?

And I don't really see how the interaction with interrupt entry could be the problem, since the code works just fine if the code in the interrupt leaves the thread on the same core.

> ARM caches are (normally) coherent, though.

Don't you need a memory barrier to get that coherency, though?

Memory barrier instructions on ARM exist to force ordering of memory operations as seen by external hardware (generally software on other cores). They obviously interact with the cache at a hardware level, but they're different layers of abstraction.

And the reason those memory operations might be seen in an order different from their appearance in the machine code is precisely the fact that the processor executes them in parallel and potentially out of order. On x86, the hardware does magic (in almost all cases) to prevent this artifact. But ARM puts the responsibility on the programmer.

But all that stuff is specified (even if it's hard to reason about). What's happening here is extra-specification, something about that cache invalidate and barrier interacts in a way that an interrupt can mess up. But we don't know what it is, because it seems like ARM didn't tell anyone.

Basically: as I see it, any OS author writing interrupt entry code on ARM64 (I work on Zephyr, though not on the ARM port) needs to put a barrier instruction on the entry path for safety, because at least some hardware misbehaves without it. But that said, almost all real OSes are going to have one anyway for locking purposes (i.e. you have to take a spinlock to interact with OS state somewhere, and htat requires a barrier on SMP ARM systems). It's likely that this Nintendo sequence is part of some kind of micro-optimized thing and not a general purpose ISR.

On what basis are you saying the interrupt messes it up?

The post directly says that if a migration doesn't happen, then nothing goes wrong.

What messes up is when you do the barrier-needing instructions on one core, and a memory barrier on a completely different core. Which seems pretty expected to me.

If the thing you're describing happens, that does sound like a hardware bug, but I don't see where you got that description from.

> I don't understand how this isn't a hardware bug

From what I understand, if the chip’s documentation would say “all interrupt handlers must start with a memory barrier”, this would be a software bug.

Isn’t it the case that a hardware bug for which a workaround is documented before shipping is ‘just’ a misfeature? (In this case, supporting user-supplied interrupt handlers would be a bit complicated. When it gets installed, you’d have to check their first instruction after first making its memory page non-writable by user code)

Back to the workaround: they seem confident that this only is a problem when doing “user-mode cache operations (flush / clean / zero)”, and those, apparently, can all be fixed to set that TLS flag. If I were trying to break into this system, I would look at both assumptions.

In particular, can you clear that secret byte directly after the kernel set it, and get the old behavior back? Worse, does “user-mode cache operations” imply those are completely run in user mode (since they can make this fix, presumably using a library provided by Nintendo)? If so, what prevents you from using your own cache flush code that doesn’t set the flag?

I might be missing something but don't see your point.

If you want every interrupt to act as a memory barrier you can just insert a memory barrier in the interrupt handler. A reason not to do this is the overhead. Also, if you know the interrupt handler will not interact with memory from the thing it interrupted or migrate the task it interrupted between cores, it isn't necessary to have a memory barrier.

> and which configurations thereof, remember the Tegra in question has both A53 and A57 cores

Only the A57 cores are enabled at all, perhaps due to a hardware limitation. Consider the A53 missing for all intent and purposes.

Will Deacon replied on Twitter how at least Linux handles this correctly [1][2] and I assume that the same is true for XNU and Windows as well.

[1] https://twitter.com/WillDeacon/status/1506375874161086471

[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

Cache coherency on SoCs is one of the more hairy aspects, and one that's getting an increasing amount of attention from the software world.

Certainly for Mac it's a well-known topic, though I won't necessarily say it's all safe. Like I said: it's hairy.

(fun fact: when you have 10s-100s of millions of units out there, those "1 in a million" chances become all too frequent...)

Cache coherency for SoCs is usually proven with formal verification. It's a very good target for such techniques because the needed properties are straightforward to write down.

I'd be very surprised to see any respectable firm shipping a multicore chip where formal verification has not been done on the cache protocol.

Note that this isn't perfect; if a needed property was not written down and was not checked, errors can be missed. But people have been doing this for at least 15 years now, and there's academic work that is older.

Edit: this doesn't mean everyone does it right, there's at least one example in the comments about someone shipping buggy cache coherence.

x86 Darwin had a bug where AVX512 K (opmask) registers would not be saved and restored on context switches when YMM registers had all 0 values. The kernel assumed that AVX512 wasn't used. They fixed it in Monterey 12.2, but still hasn't backported. Heisenbug for sure.

Ref: https://github.com/golang/go/issues/49233#issuecomment-96373...

It's unfortunate that more architectures aren't implemented with something like DEC Alpha's PALCode, where the CPU's firmware is essentially a single-tenant hypervisor. The OS kernel proper uses upcalls to the firmware for manipulating page tables, etc. If abstracted properly, the OS would allocate the storage space, but ask the firmware to serialize and deserialize processor state on context switches.

In retrospect, the Alpha went a bit too far with the permissiveness of its memory model, and it turned out they really did need single-byte load and store instructions. However, it was really an elegant high-level design, and the implementation team was top-notch (same folks worked on StrongArm, AMD Athlon, P.A. Semi's PWRficient, and Apple silicon).

I think it helps if the right tools exist in the first place.