As I read through the meltdown paper, it looks really difficult to have the security we want and the performance we want at the same time. It's pretty crazy, but here's my limited understanding:
There's a huge shared buffer between two threads. 256 * 4K. One thread reads a byte of kernel memory, literally any byte it wants, and it then reads one of those 4K pages from that buffer in order to cache that one memory page that corresponds to the byte it just read. Then at some point the CPU determines that the thread shouldn't be permitted to access the kernel memory location, and rolls back all of that speculative execution, but the cached memory page isn't affected by the rollback.
The other thread iterates through those 256 pages, timing how long it takes to read from each page, and the one page that Thread A accessed will have a different (shorter?) timing because it's cached already. It now understands one byte of kernel memory that it shouldn't. That's just one byte but the whole process is so fast that it's easy to just go nuts on the whole kernel address space.
So what would the fixes be? Disable speculative execution? Only do it if the target memory location is within userspace, or within the same space as the executing address? Plug all of the sideband information leak mechanisms? I dunno.
Keep a small pool of cache lines exclusive to speculative execution, discard when non taken, rename affected cache lines (like register renaming so no copy) when taken.
In the simplest Meltdown case, the offending instruction is really executed and a General Protection Fault occurs. That is handled in the kernel which at that point could (simply?) flush all caches to remove the leaked information.
The real problem with Meltdown seems to occur when:
1) The offending instruction is NOT really executed because it is in a branch which is not actually taken.
2) The offending instruction is executed but within a transaction, which leads to an exception-free rollback (with leaked information left in cache though).
AFAIK neither is (or can be made) visible to the kernel (which could explain the very large PTI patch), but I do wonder if they are events that can be hanlded at the microcode level, in which case a microcode update from Intel could mitigate them.
The MELTDOWN one is the easy one (as is evident by the fact that this is the one that only seems to affect Intel CPUs).
When a load is found to be illegal, an exception flag is set so that if the instruction is retired (ie. the speculated execution is found to be the actual path taken), a page fault exception can be raised. To prevent MELTDOWN, at the same time that the flag is raised you can set the result of the load to zero.
SPECTRE is the really hard one to deal with. Part of the solution might be providing a way for software to flush the branch predictor state.
Maybe separate BTBs. Or maybe disable branch target prediction when in kernel mode (but then some VM process may still observe some other process running inside a different VM via a side channel).
Not allow user processes to recover from a SEGV. The attack depends on a signal hander that traps the signal and resumes execution. If this is disabled then the attack will not work. This would affect two types of systems:
1. Badly written code where bugs are being masked by the handler.
2. Any kind of virtualization?
So, for cloud providers it looks like a 30% performance hit, but for the rest of us I would rather have a patch that stops applications handling the SEGV trap.
The attacks do not rely on recovering from SIGSEGV. The speculated execution that accesses out-of-bounds or beyond privilege level happens in a branch that's predicted-taken but actually not-taken, so the exception never occurs.
Ah, ok - then I read the paper wrongly. i’ll go back and have another look.
Edit: yes, I missed the details in section 4.1 when I skimmed through. I’m not familiar with the Kocker paper, but I assume the training looks like this?
I can imagine some ways to armor the branch predictor, similar in principle to how languages like Perl have to include a random seed in their hash code (in some circumstances) to avoid being able to pre-compute values that will all hash to the same thing [1]. There should be some ways to relatively cheaply periodically inject such a randomization into the prediction system enough to prevent that aspect of the attack. This will cost something but probably not be noticeable to even the most performance-sensitive consumers.
But no solution leaps to mind for the problem of preventing speculative code from leaking things via cache, short of entirely preventing speculating code from being able to load things into the cache. If nobody can come up with a solution for that, that's going to cost us something to close that side channel. Not sure what though, without a really thorough profiling run.
And I'd put my metaphorical $5 down on someone finding another side channel from the speculative code; interactions with changing processor flags in a speculative execution, or interaction with some forgotten feature [2] where the speculation ends up incorrectly undone or something.
It's going to be really hard to give up real world gains from branch prediction. Branch prediction can make a lot of real world (read "not the finest code in the world") run at reasonable speeds. Another common pattern to give up would be eliding (branch predicting away) nil reference checks.
> short of entirely preventing speculating code from being able to load things into the cache
Some new server processors allow us to partition cache (to prevent noisy neighbors) [1,2]. I don't have experience working with this technology but everything I read makes me believe this mechanism can works on a per process basis.
If that kind of complexity is already possible in CPU cache hierarchy I wonder if it's possible to implement per process cache encryption. New processors (EPYC) can already use different encryption keys for each VM, so it might be a matter of time till this is extended further.
It's possible to key the cache in the kernel on CPL so at least there should be no user / kernel space scooping of cache lines.
It's possible we can never fully prevent all attacks in same address space. So certain types of applications (JIT and sandboxes) might forever be a cat and mouse game since we're unlikely to give up on branch prediction.
AFAICT injecting any sort of delay that prevents this attack would also completely negate any benefit from caches and that would take us back to 2000s performance at best, even with 10-16 core Xeon monsters. The branch predictor is really just a glorified cache prefetcher so you'd not only have to harden the branch predictor but anything that could possibly access the cache lines that the branch predictor has pulled up.
"The branch predictor is really just a glorified cache prefetcher so you'd not only have to harden the branch predictor..."
I was just thinking of the part they were talking about where it was too predictable, not the rest of the issues. Instead of a single hard-coded algorithm we could switch to something that has a random key element, like XOR'ing a rotating key instead of a hard-coded one, similar to some of the super-basic hashing some predictors already do. Prefetching I just don't know what to do with. I mentally started down the path of considering what it would take for the CPU to pretend the page was never cached in the first place on a misprediction, but yeow, that got complicated fast, between cache coherency issues between processors and all of the other crap going on there, plus the fact that there's just no time when we're talking about CPU and L1 interactions.
Timing attacks really blow. Despite the "boil the ocean" nature of what I'm about to say, I find myself wondering if we aren't better served by developing Rust and other stronger things to the point that even if the system is still vulnerable to timing attacks it's so strong everywhere else that it's a manageable problem. Maybe tack on some heuristics to try to deal with obvious hack attempts and at least raise the bar a bit. More process isolation (as in other links mtanski gives you can at least confine this to a proces). (What if Erlang processes could truly be OS processes as far as the CPU was concerned?) I'm not saying that is anything even remotely resembling easy... I'm saying that it might still be easier than trying to prevent timing attacks like this. That's a statement about the difficulty of fixing timing attacks in general, not the easy of "hey, everybody, what if you just wrote code better?", a "solution" I often make fun of myself.
There's a huge shared buffer between two threads. 256 * 4K. One thread reads a byte of kernel memory, literally any byte it wants, and it then reads one of those 4K pages from that buffer in order to cache that one memory page that corresponds to the byte it just read. Then at some point the CPU determines that the thread shouldn't be permitted to access the kernel memory location, and rolls back all of that speculative execution, but the cached memory page isn't affected by the rollback.
The other thread iterates through those 256 pages, timing how long it takes to read from each page, and the one page that Thread A accessed will have a different (shorter?) timing because it's cached already. It now understands one byte of kernel memory that it shouldn't. That's just one byte but the whole process is so fast that it's easy to just go nuts on the whole kernel address space.
So what would the fixes be? Disable speculative execution? Only do it if the target memory location is within userspace, or within the same space as the executing address? Plug all of the sideband information leak mechanisms? I dunno.