Hacker News new | ask | show | jobs
by londons_explore 3091 days ago
The "speculation time" can be hundreds of cycles if you have a branch or memory read that takes a long time to resolve.

This problem is already solved with speculative writes to main memory - a speculative store buffer keeps a sequence of memory operations which need to be done when the operation retires. These buffers are very power hungry, because every future speculative read must check every entry in the speculative store buffer to see if it is re-reading a previously written address. That many to many mapping leads to an exponential amount of checking logic.

The same could be done for cache reads/writes, but I have a feeling it would quickly get very complex, large, and power hungry.

2 comments

Those hundreds of cycles of speculative execution can't include more than a handful of cache modifications though, because a change to the caching state implies a miss in the speculated execution itself. So you can't have more than a small number of those before the original stall is over and the misprediction resolved.
What you are describing is sinply plain associative memory. If I remember correctly, this is complex in its imolementation, but does not grow exponentially. Plesse correct me if I am wrong.
fully associative memory is generally very power hungry.

Thats why in CPU's caches are usually "2 way associative" or "4 way associative".

That means the data you're looking for might be in one of 2 (or 4) places. Fully associative means the data you're looking for might be in any memory slot, and you're gonna have to check them all. Checking them all in parallel is possible, so it isn't a speed issue, but it is a massive power issue. Average power use is the main limiting factor in CPU's today.

In general in a CPU, transistors which stay in the same state don't use much power. Transistors changing state use power. In a fully associative memory, the transistors doing the comparing change state with every comparison. Whereas with a regular memory only the transistors for the individual bit of the memory being read or written change state and use power.

(the above is a simplification, but contains the key elements).

Associative memory is a huge matrix of and gates in the comparator. But we are talking about buffering results of speculated reads after a predicted branch.

The density if load instructions in code is not particularly high on average. Also, all loads are subject to the same latencies, so that the chance that a speculative read completes before the blocking one is also low (must be cached in a higher level cache, I think).

Taken together, I would be surprised if more than about 10 speculative reads can successfully complete at all in that time frame, even though it is hundreds of cycles. So that would be around 1000 and gates and 1000 memory cells. Doesn't sound too big to me.