|
|
|
|
|
by brucehoult
494 days ago
|
|
No, my argument is that even if load with scaled indexed addressing takes a cycle longer, it's a rare enough thing given a good compiler and, yes, in many cases vector/SIMD processing, that you are very unlikely to actually be able to measure a difference on a real-world program. I'll also note that only x86 can do base + scaled index + constant offset in one instruction. Arm needs two instructions, just like RISC-V. |
|
Just ran a quick benchmark - seems Haswell handles "mov rbx, QWORD PTR [rbx+imm]" with 4c latency if there's no chain instructions (5c latency in all other cases, including indexed load without chain instrs, and "mov rbx, QWORD PTR [rbx+rcx*8+0x12345678]" always). So even with existing cases where the indexed load pushes it over to the next cycle, there are cases where the indexed load is free too.