|
|
|
|
|
by clamchowder
496 days ago
|
|
"don't run any faster than a sequence of simpler instructions" This is false. You can find examples of both x86-64 and aarch64 CPUs that handle indexed addressing with no extra latency penalty. For example AMD's Athlon to 10H family has 3 cycle load-to-use latency even with indexed addressing. I can't remember off the top of my head which aarch64 cores do it, but I've definitely come across some. For the x86-64/aarch64 cores that do take additional latency, it's often just one cycle for indexed loads. To do indexed addressing with "simple" instructions, you'd need at a shift and dependent add. That's two extra cycles of latency. |
|
But that is all missing the point of "true but irrelevant".
You can't just compare the speed of an isolated scaled indexed load/store. No one runs software that consists only, or even mostly, of isolated scaled indexed load/store.
You need to show that there is a measurable and significant effect on overall execution speed of the whole program to justify the extra hardware of jamming all of that into one instruction.
A good start would be to modify the compiler for your x86 or Arm to not use those instructions and see if you can detect the difference on SPEC or your favourite real-world workload -- the same experiment that Cocke conducted on IBM 370 and Patterson conducted on VAX.
But even that won't catch the possibility that a RISC-V CPU might need slightly more clock cycles but the processor is enough simpler that it can clock slightly higher. Or enough smaller that you can use less energy or put more cores in the same area of silicon.
And as I said, in the cases where the speed actually matters it's probably in a loop and strength-reduced anyway.
It's so lazy and easy to say that for every single operation faster is better, but many operations are not common enough to matter.