| My understanding: Delay slots only make sense when you don't have a branch predictor. I get the impression that many people massively underestimate how powerful branch prediction is. Even a simple 2 bit saturating counter will correctly predict about 90% of branches, and the accuracy only goes up with better designs. So why optimise for the uncommon case of incorrectly predicted branches? In the worst case with static branch delay slots, it actually harms branch prediction, because instead of executing the correctly predicted instructions, it's executing the delay slot, which is often a nop because the compiler couldn't find something to put there. With your other ideas (multiple decoders, explicit delay slots) it's just a question about if it's a good use of resources (design time, transistors, compiler support) to support this uncommon case, or if you might be better off optimising something else like the branch predictor so more code goes down the common path, or just improving the pipeline's throughput in general. |
Agreed.
I'll go even further: lots of other RISC ideas only really matter when you don't have a branch predictor (and your transistor and design budgets are tiny).
Single-length instructions, for example. Even x86 instructions aren't much of a problem when you can afford to throw more pipeline stages at the problem -- which you can with a branch predictor.
I don't think anybody in the mid-80's understood how good branch predictors can be. Most people probably still didn't really understand it in the early 90's.