Hacker News new | ask | show | jobs
by tavianator 527 days ago
I don't think there are a thousand constant registers. I think the renamer just represents (reg, 10-bit offset) pairs rather than just registers.

Also the problem affects SHL as well as SHLX, I didn't realize until just now.

1 comments

The speculation of a "reg + 10-bit offset" representation feels wrong to me.

That requires a whole bunch of extra 64-bit full-adders everywhere one of these pairs might be consumed (so realistically, on every single read port of the register file). 64-bit adders take quite a bit of latency, so you don't want extra adders on all your critical paths.

In the case where it appears to be holding a reg + offset pair, what I think has actually happened is that renamer (and/or uop fusion) has rewritten the uop to a 3-input add, with the offset as the third input.

> Also the problem affects SHL as well as SHLX, I didn't realize until just now.

And presumably SHR/SHRX/SAR/SARX too?

You don't quite need a full 64-bit adder to materialize the proper value, as one argument is only a 10-bit int. So a 10-bit adder, a 54-bit increment, and muxing it with the original depending on the add carry.

And, presumably, the OP shift case here is in fact a case of there not being a built-in immediate adder and thus a need for fixup uops being inserted to materialize it?

Right. Actually it turns out it's 11 bits, since [-1024, 1023] are all supported by the immediate add renamer.

In general I think people are overstating the delay of an additional 64-bit add on register file reads (though I'm not a hardware guy so maybe someone can correct me). There are carry-lookahead adders with log_2(n) == 6 gate delays. Carry-save adders might also be relevant to how they can do multiple dependent adds with 0c latency.

> And, presumably, the OP shift case here is in fact a case of there not being a built-in immediate adder and thus a need for fixup uops being inserted to materialize it?

No, the perf counters show 1 uop dispatched/retired in both the slow and fast cases.

Ah, good to know on the uop count. Still could be (or, well, has to be to some extent) the same concept, just pipelined within one uop.
I dunno, you could imagine it happens [speculatively?] in parallel, at the cost of a read port and adder for each op that can be renamed in a single cycle:

1. Start PRF reads at dispatch/rename

2. In the next cycle, you have the result and compute the 64-bit result. At the same time, the scheduler is sending other operands [not subject to the optimization] to read ports.

3. In the next cycle, the results from both sets of operands are available

Seems rather pointless to implement it like that. You would save space in the scheduler because the uops have been fused, but execution time and execution unit usage is the same as just executing the original ops before this optimisation was implemented.

It also adds an extra cycle of scheduling latency, which may or may not be an issue (I really have no idea how far ahead these schedulers can schedule).

If you think about the "1024 constant registers" approach: It allows you to have a small 10bit adder in the renamer which handles long chains of mov/add/sub/inc/dec ops, as long as the chain stays in the range of 1024. This frees the main adders for bigger sums, or maybe you can power gate them.

And in the case where one of the inputs is larger than 1024, or it's value is unknown during renaming, the renamer can still merge two two-input adds into a single three input add uop.

add3 operation will have 3 inputs, though. do we have other integer operations with 3 64-bit inputs?
Yeah I am pretty sure that the renamer adds the tracked immediate(s) to the emitted uop, but that's not inconsistent with tracking "reg offset pairs" is it?