| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by BeeOnRope 3060 days ago
	It doesn't decode to a single fused uop, but rather 2 fused pairs of uops (so 4 total unfused uops). So there is fusion going on, twice (the load and ALU op are fused, and the two store uops are fused). If you use the three-instruction sequence the load and ALU op can't fuse, which potentially makes it slower (but not in this case since the bottleneck is elsewhere).

1 comments

timmclean 3060 days ago

I believe you're thinking about `add`. According to Agner Fog's instruction tables, the load and ALU uops are fused for `add`s, but not in the case of `inc`

http://www.agner.org/optimize/instruction_tables.pdf

link

BeeOnRope 3060 days ago

Yes weird - it was pointed out somewhere elsewhere here and I updated some of my comments but not this one.

It's quite unusual that add gets the 2-uop treatment but inc doesn't. Yes, they treat flags differently, but that's mostly been resolved though flag renaming, and the reg forms of inc don't suffer any penalty.

I'll have to double check if this is true. If it is, compilers should generally be preferring add [mem], 1 then (except perhaps when optimizing for size) - the difference in the flag behavior is pretty much never relevant for compiled code.

link

pbsd 3060 days ago

Renaming is unrelated to my guess about the flags. The point is that there's a limit to how many inputs a fused uop can have, 3, and the flags register may become one input too many to be able to fuse the uops. For example,

    inc [rdi+rbx]

has the obvious rdi and rbx dependencies, the flags register, plus (presumably, depending on implementation details) an allocated virtual register for the 1 that is added. On the register forms this limit is never a bottleneck.

You also see the same behavior, according to Agner, on SHR/SHL m, i, which may or may not alter some flags depending on shift amounts, and strangely on NOT m, which explicitly does not alter the flags in any situation. This latter one makes little sense.

link

BeeOnRope 3060 days ago

Sure, but everything you say about inc is true of add as well, but add double-fuses fine (by "double-fuse" I mean it is 2/4 ops in the fused/unfused domains unlike inc which is 3/4). In general many RMW instructions (double) fuse and most (all?) also modify the flags.

I doubt there is a virtual register for the 1 really - sure there is some storage for it somewhere in the ROB or the scheduler or whatever, but it doesn't need to be "renamed" in the usual sense since it's not visible anywhere. In any case, the add case is "worse" since it can have a real architectural register there, not just an implied immediate 1.

Yes, there is a definitely a limit on the number of inputs a uop can have - and you can see this in the effect of "unlamination" which is where a uop fuses in the uop cache, but then unfuses after that and so mostly acts like an unfused uop (except for uop cache space). This shows up with indexed addressing modes.

For example:

    add [rax], 1

fully double-fuses, but:

    add [rax + rbx], 1

Double-fuses only in the uop cache (counts as 2 there), but unlaminates once after that (counts as 3 in the rest of the fused domain).

Interestingly though this guy:

    add [rax], rbx

Still fully double-fuses everywhere, despite having the same number of input registers as the add [rax + rbx], case. Probably it's easier for the renamer though because the registers are spread across the uops more evenly rather than being concentrated in the load uop?

Moving away from RMW to load-op there are other indications flags aren't a problem: things like BMI shrx/shlx/rorx with memory operand don't fuse despite that these don't update flags at all. On the other and ANDN, which is similarly in BMI and is also 3-arg instruction (distinct src/dest) and updates flags does fuse! So actually I'd say updating the flags in a consistent way makes it more likely to fuse.

Maybe that's the answer then?

Anything that updates the flags in the "standard way" - i.e., SF,ZF,CF,OF all set to something, can (potentially) micro-fuse. Anything which doesn't - whether that is updating fewer flags (inc) or no flags (shrx) or updating them "weirdly" (shl and friends) isn't eligible. Interesting theory and still consistent in broad strokes with your "it's the flags!" claim.

link

pbsd 3060 days ago

This theory is cool, but I don't think it works, all things considered. PDEP and PEXT should have the same unfused behavior as SHLX, since they also do not change any flags, but they _do_ fuse. BEXTR should (or could) fuse, but doesn't. So I don't know.

link

BeeOnRope 3059 days ago

You are right, so yeah I can't explain really why certain ops fuse and some don't. There doesn't seem to be a strong pattern.

link