| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by brucehoult 494 days ago
	No, my argument is that even if load with scaled indexed addressing takes a cycle longer, it's a rare enough thing given a good compiler and, yes, in many cases vector/SIMD processing, that you are very unlikely to actually be able to measure a difference on a real-world program. I'll also note that only x86 can do base + scaled index + constant offset in one instruction. Arm needs two instructions, just like RISC-V.

1 comments

dzaima 494 days ago

My point with vectorization was that the one case where indexed loads/stores are most defendably unnecessary is also the case where you shouldn't want scalar mem ops in the first place. Thus meaning that many scalar mem ops would be outside of tight loops, and outside of tight loops is also where unrolling/strength reduction/LICM to reduce the need of indexed loads is least applicable.

Just ran a quick benchmark - seems Haswell handles "mov rbx, QWORD PTR [rbx+imm]" with 4c latency if there's no chain instructions (5c latency in all other cases, including indexed load without chain instrs, and "mov rbx, QWORD PTR [rbx+rcx*8+0x12345678]" always). So even with existing cases where the indexed load pushes it over to the next cycle, there are cases where the indexed load is free too.

link

brucehoult 494 days ago

And outside of tight loops is where a cycle here or there is irrelevant to the overall speed of the program. All the more so if you're going to have cache or TLB misses on those loads.

link

dzaima 494 days ago

I quite heavily disagree. Perhaps might apply to programs which do spend like 90% of their time in a couple tight loops, but there's tons of software that isn't that simple (especially web.. well, everything, but also compilers, video game logic, whatever bits of kernel logic happen in syscalls, etc), instead spending a ton of time whizzing around a massive mess. And you want that mess to run as fast as possible regardless of how much the mess being a mess makes low-level devs cry. If there's headroom in the AGU for a 64-bit adder, I'd imagine it's an extremely free good couple percent boost; though the cost of extra register port(s) (or logic of sharing some with an ALU) might be annoying.

And indexed loads aren't a "here or there", they're a pretty damn common thing; like, a ton more common than most instructions in Zbb/Zbc/Zbs.

link

brucehoult 494 days ago

This is not a discussion that can be resolved in the abstract. It requires actual experimentation and data and pointing at actual physical CPUs differing only in this respect and compare the silicon area, energy use, MHz achieved, and cycles per program.

link

dzaima 494 days ago

It's certainly not a thing to be resolved in the abstract, but it's also far from thing to be ignored as irrelevant in the abstract.

But I have a hard time imagining that my general point of "if there's headroom for a full 64-bit adder in the AGU, adding such is very cheap and can provide a couple percent boost in applicable programs" is far from true. Though the register file port requirement might make that less trivial as I'd like it to be.

link