Hacker News new | ask | show | jobs
by static_noise 3091 days ago
Only if your speculative reads do cause irreversible side-effects on those caches. You could implement them in a way that doesn't modify the caches... but that would be complicated and probably use more power and have lower performance.
2 comments

One of the main reasons for speculative execution is to fetch data into the caches ahead of them being needed. If you don't modify the cache, then you throw that away.

May be one way would be to use a smaller, separate cache for speculative execution and then copy that value to the regular cache once speculation is confirmed? This would add a one cycle latency for cache-to-cache transfer but there might be better ways.

This might actually improve performance because it would prevent actually-hot data being evicted from the cache in favour of cold data that was loaded in a not-taken speculated branch.
no need for cache-to-cache transfer, just like with invisible registers we could have cache line renaming
It is not enough. The cache line still need be evicted (or at least marked as no longer exclusive) from other cpus, so the side effect is still visible.
There does not need to be a performance hit, but cache complexity must rise: speculative execution must use a separate cache for any data that was fetched speculatively. Only when that branch is truly accepted must that data enter the "real" cache. As long as speculative execution does not go on for too long, these secondary caches can stay really tiny (a handful of cache lines maybe).
The "speculation time" can be hundreds of cycles if you have a branch or memory read that takes a long time to resolve.

This problem is already solved with speculative writes to main memory - a speculative store buffer keeps a sequence of memory operations which need to be done when the operation retires. These buffers are very power hungry, because every future speculative read must check every entry in the speculative store buffer to see if it is re-reading a previously written address. That many to many mapping leads to an exponential amount of checking logic.

The same could be done for cache reads/writes, but I have a feeling it would quickly get very complex, large, and power hungry.

Those hundreds of cycles of speculative execution can't include more than a handful of cache modifications though, because a change to the caching state implies a miss in the speculated execution itself. So you can't have more than a small number of those before the original stall is over and the misprediction resolved.
What you are describing is sinply plain associative memory. If I remember correctly, this is complex in its imolementation, but does not grow exponentially. Plesse correct me if I am wrong.
fully associative memory is generally very power hungry.

Thats why in CPU's caches are usually "2 way associative" or "4 way associative".

That means the data you're looking for might be in one of 2 (or 4) places. Fully associative means the data you're looking for might be in any memory slot, and you're gonna have to check them all. Checking them all in parallel is possible, so it isn't a speed issue, but it is a massive power issue. Average power use is the main limiting factor in CPU's today.

In general in a CPU, transistors which stay in the same state don't use much power. Transistors changing state use power. In a fully associative memory, the transistors doing the comparing change state with every comparison. Whereas with a regular memory only the transistors for the individual bit of the memory being read or written change state and use power.

(the above is a simplification, but contains the key elements).

Associative memory is a huge matrix of and gates in the comparator. But we are talking about buffering results of speculated reads after a predicted branch.

The density if load instructions in code is not particularly high on average. Also, all loads are subject to the same latencies, so that the chance that a speculative read completes before the blocking one is also low (must be cached in a higher level cache, I think).

Taken together, I would be surprised if more than about 10 speculative reads can successfully complete at all in that time frame, even though it is hundreds of cycles. So that would be around 1000 and gates and 1000 memory cells. Doesn't sound too big to me.

Unclear. From the Spectre paper:

More broadly, potential counter- measures limited to the memory cache are likely to be insufficient, since there are other ways that speculative execution can leak information. For example, timing ef- fects from memory bus contention, DRAM row address selection status, availability of virtual registers, ALU ac- tivity, and the state of the branch predictor itself need to be considered.

... also ...

Of course, speculative execution will also affect conventional side channels, such as power and EM

Historically I think it's been assumed that you can't extract much useful information from a modern speculating CPU via EM radiation, but these attacks constantly seem to be surprising people. Re-programming a wifi chip to monitor interference generated by the CPU to spy on speculation? It would have sounded like a pie in the sky fantasy ... yesterday.

I will be highly impressed if anyone managed to pull of a reliable and generic side channel attack based on the hardware you listed. Most of what you describe is squarely in the realm of tinfoil hattery.

DRAM and the memory bus is also affected by DMA operations running independently of the CPU.

Power consumption? There is no hardware available to measure that, let alone at the time resolution required. If you have to first attach a GHz bandwidth oscilloscope to the computer you might as well just reboot it or dump its RAM contents or whatever.

Forget about reprogramming a Wi-Fi chip. They operate on narrow channels in the 2.4GHz range and have fixed hardware for modulation. You would at least have to force the CPU to switch to the right frequency and then be lucky enough that it radiates a signal that demodulated to something sensible within the Wi-Fi hardware. This is physically impossible on current hardware.

Also, on a different note, we cannot sacrifice performance willy-nilly for the sake of a bit of potential security gains. A 30% performance loss on servers means that the counter move is to consume 30% more power to maintain current levels of operation in a data center. This energy needs to be generated, which means that someone is burning oil or gas for it with all the consequences. In essence, the current patches will result in an extra thousands or millions of tons of CO2 in the atmosphere. More efficient replacement hardware will eventually produced with extra environmental impact. We need to find ways to avoid that. Soon.