Hacker News new | ask | show | jobs
by gmueckl 3092 days ago
There does not need to be a performance hit, but cache complexity must rise: speculative execution must use a separate cache for any data that was fetched speculatively. Only when that branch is truly accepted must that data enter the "real" cache. As long as speculative execution does not go on for too long, these secondary caches can stay really tiny (a handful of cache lines maybe).
2 comments

The "speculation time" can be hundreds of cycles if you have a branch or memory read that takes a long time to resolve.

This problem is already solved with speculative writes to main memory - a speculative store buffer keeps a sequence of memory operations which need to be done when the operation retires. These buffers are very power hungry, because every future speculative read must check every entry in the speculative store buffer to see if it is re-reading a previously written address. That many to many mapping leads to an exponential amount of checking logic.

The same could be done for cache reads/writes, but I have a feeling it would quickly get very complex, large, and power hungry.

Those hundreds of cycles of speculative execution can't include more than a handful of cache modifications though, because a change to the caching state implies a miss in the speculated execution itself. So you can't have more than a small number of those before the original stall is over and the misprediction resolved.
What you are describing is sinply plain associative memory. If I remember correctly, this is complex in its imolementation, but does not grow exponentially. Plesse correct me if I am wrong.
fully associative memory is generally very power hungry.

Thats why in CPU's caches are usually "2 way associative" or "4 way associative".

That means the data you're looking for might be in one of 2 (or 4) places. Fully associative means the data you're looking for might be in any memory slot, and you're gonna have to check them all. Checking them all in parallel is possible, so it isn't a speed issue, but it is a massive power issue. Average power use is the main limiting factor in CPU's today.

In general in a CPU, transistors which stay in the same state don't use much power. Transistors changing state use power. In a fully associative memory, the transistors doing the comparing change state with every comparison. Whereas with a regular memory only the transistors for the individual bit of the memory being read or written change state and use power.

(the above is a simplification, but contains the key elements).

Associative memory is a huge matrix of and gates in the comparator. But we are talking about buffering results of speculated reads after a predicted branch.

The density if load instructions in code is not particularly high on average. Also, all loads are subject to the same latencies, so that the chance that a speculative read completes before the blocking one is also low (must be cached in a higher level cache, I think).

Taken together, I would be surprised if more than about 10 speculative reads can successfully complete at all in that time frame, even though it is hundreds of cycles. So that would be around 1000 and gates and 1000 memory cells. Doesn't sound too big to me.

Unclear. From the Spectre paper:

More broadly, potential counter- measures limited to the memory cache are likely to be insufficient, since there are other ways that speculative execution can leak information. For example, timing ef- fects from memory bus contention, DRAM row address selection status, availability of virtual registers, ALU ac- tivity, and the state of the branch predictor itself need to be considered.

... also ...

Of course, speculative execution will also affect conventional side channels, such as power and EM

Historically I think it's been assumed that you can't extract much useful information from a modern speculating CPU via EM radiation, but these attacks constantly seem to be surprising people. Re-programming a wifi chip to monitor interference generated by the CPU to spy on speculation? It would have sounded like a pie in the sky fantasy ... yesterday.

I will be highly impressed if anyone managed to pull of a reliable and generic side channel attack based on the hardware you listed. Most of what you describe is squarely in the realm of tinfoil hattery.

DRAM and the memory bus is also affected by DMA operations running independently of the CPU.

Power consumption? There is no hardware available to measure that, let alone at the time resolution required. If you have to first attach a GHz bandwidth oscilloscope to the computer you might as well just reboot it or dump its RAM contents or whatever.

Forget about reprogramming a Wi-Fi chip. They operate on narrow channels in the 2.4GHz range and have fixed hardware for modulation. You would at least have to force the CPU to switch to the right frequency and then be lucky enough that it radiates a signal that demodulated to something sensible within the Wi-Fi hardware. This is physically impossible on current hardware.

Also, on a different note, we cannot sacrifice performance willy-nilly for the sake of a bit of potential security gains. A 30% performance loss on servers means that the counter move is to consume 30% more power to maintain current levels of operation in a data center. This energy needs to be generated, which means that someone is burning oil or gas for it with all the consequences. In essence, the current patches will result in an extra thousands or millions of tons of CO2 in the atmosphere. More efficient replacement hardware will eventually produced with extra environmental impact. We need to find ways to avoid that. Soon.