Hacker News new | ask | show | jobs
by solarexplorer 1486 days ago
We do have these architectures already in the embedded space and as DSPs. I suppose, they would be interesting for supercomputers as well. But for general purpose CPUs they would be a difficult sell. Since the memory size and latency would be part of the ISA, binaries could not run unchanged on different memory configurations, you would need another software layer to take care of that. Context switching and memory mapping would also need some rethinking. Of course, all of this can be solved, but it would make adoption more difficult.

And last not least, unknown memory latency is not the only source of problems, branch (mis)predictions are another. And they have the same remedies as cache misses: multithreading and speculative execution.

So if you wanted to get rid of branch prediction as well, you could come up with something like the CRAY-1.

1 comments

You are right that a kind of multi-threading can be useful to mitigate the effects of branch mispredictions.

However, for this, fine-grained multi-threading is enough. Simultaneous multi-threading does not bring any advantage, because the thread with the mispredicted branch cannot progress.

Out-of-order execution cannot be used during branch mispredictions, so like I have said, both SMT and OoOE are techniques useful only when a data cache memory exists.

Any CPU with pipelined instruction execution needs a branch predictor and it needs to execute speculatively the instructions on the predicted path, in order to avoid the pipeline stalls caused by control dependencies between instructions. An instruction cache memory is also always needed for a CPU with pipelined instruction execution, to ensure that the instruction fetch rate is high enough.

Unlike simultaneous multi-threading, fine-grained multi-threading is useful in a CPU without a data cache memory, not only because it can hide the latencies of branch mispredictions, but also because it can hide the latencies of any long operations, like it is done in all GPUs.

Fine-grained multi-threading is significantly simpler to implement than simultaneous multi-threading.