Hacker News new | ask | show | jobs
by brucehoult 495 days ago
Ok, there exist cores that don't have a penalty for scaled indexed addressing (though many do). Or is it that they don't have any benefit from non-indexed addressing? Do they simply take a clock speed hit?

But that is all missing the point of "true but irrelevant".

You can't just compare the speed of an isolated scaled indexed load/store. No one runs software that consists only, or even mostly, of isolated scaled indexed load/store.

You need to show that there is a measurable and significant effect on overall execution speed of the whole program to justify the extra hardware of jamming all of that into one instruction.

A good start would be to modify the compiler for your x86 or Arm to not use those instructions and see if you can detect the difference on SPEC or your favourite real-world workload -- the same experiment that Cocke conducted on IBM 370 and Patterson conducted on VAX.

But even that won't catch the possibility that a RISC-V CPU might need slightly more clock cycles but the processor is enough simpler that it can clock slightly higher. Or enough smaller that you can use less energy or put more cores in the same area of silicon.

And as I said, in the cases where the speed actually matters it's probably in a loop and strength-reduced anyway.

It's so lazy and easy to say that for every single operation faster is better, but many operations are not common enough to matter.

1 comments

So your argument isn't that it's irrelevant, but rather that it might be irrelevant, if you happen to have a core where the extra latency of a 64-bit adder on the load/store AGU pushes it just over to the next cycle.

Though I'd imagine that just having the extra cycle conditionally for indexed load/store instrs would still be better than having a whole extra instruction take up decode/ROB/ALU resources (and the respective power cost), or the mess that comes with instruction fusion.

And with RISC-V already requiring a 12-bit adder for loads/stores, thus and an increment/decrement for the top 52 bits, the extra latency of going to a full 64-bit adder is presumably quite a bit less than a full separate 64-bit adder. (and if the mandatory 64+12-bit adder already pushed the latency up by a cycle, a separate shNadd will result in two cycles of latency over the hypothetical adderless case, despite 1 cycle clearly being feasible!)

Even if the RISC-V way might be fine for tight loops, most code isn't such. And ideally most tight loops doing consecutive loads would vectorize anyway.

We're in a world where the latest Intel cores can do small immediate adds at rename, usually materializing them in consuming instructions, which I'd imagine is quite a bit of overhead for not that much benefit.

No, my argument is that even if load with scaled indexed addressing takes a cycle longer, it's a rare enough thing given a good compiler and, yes, in many cases vector/SIMD processing, that you are very unlikely to actually be able to measure a difference on a real-world program.

I'll also note that only x86 can do base + scaled index + constant offset in one instruction. Arm needs two instructions, just like RISC-V.

My point with vectorization was that the one case where indexed loads/stores are most defendably unnecessary is also the case where you shouldn't want scalar mem ops in the first place. Thus meaning that many scalar mem ops would be outside of tight loops, and outside of tight loops is also where unrolling/strength reduction/LICM to reduce the need of indexed loads is least applicable.

Just ran a quick benchmark - seems Haswell handles "mov rbx, QWORD PTR [rbx+imm]" with 4c latency if there's no chain instructions (5c latency in all other cases, including indexed load without chain instrs, and "mov rbx, QWORD PTR [rbx+rcx*8+0x12345678]" always). So even with existing cases where the indexed load pushes it over to the next cycle, there are cases where the indexed load is free too.

And outside of tight loops is where a cycle here or there is irrelevant to the overall speed of the program. All the more so if you're going to have cache or TLB misses on those loads.
I quite heavily disagree. Perhaps might apply to programs which do spend like 90% of their time in a couple tight loops, but there's tons of software that isn't that simple (especially web.. well, everything, but also compilers, video game logic, whatever bits of kernel logic happen in syscalls, etc), instead spending a ton of time whizzing around a massive mess. And you want that mess to run as fast as possible regardless of how much the mess being a mess makes low-level devs cry. If there's headroom in the AGU for a 64-bit adder, I'd imagine it's an extremely free good couple percent boost; though the cost of extra register port(s) (or logic of sharing some with an ALU) might be annoying.

And indexed loads aren't a "here or there", they're a pretty damn common thing; like, a ton more common than most instructions in Zbb/Zbc/Zbs.

This is not a discussion that can be resolved in the abstract. It requires actual experimentation and data and pointing at actual physical CPUs differing only in this respect and compare the silicon area, energy use, MHz achieved, and cycles per program.