|
|
|
|
|
by pbsd
532 days ago
|
|
Interleaving CQO and SHLX results in ~1.33 throughput with the anomaly, ~2.0 without. This ratio is more or less constant whether it's 1:1 or 2:2 or 4:4 or 8:8 (with 1:1 it's slightly lower at ~1.28). This may or may not be consistent with one CQO uop being executed once a cycle as expected, and one SHLX uop taking a a spot (stalling for one cycle?) for 2 cycles, resulting in a runtime of (x/2 * 1 + x/2 * 2)/2 ~ x/1.33 cycles. |
|
So maybe it's like you say: SHLX with the anomaly is only semi-pipelined: it takes 2 cycles before the next SHLX can dispatch. Perhaps it needs to use 1 cycle to handle the unfolding of the folded immediates, then the shift happens in the second cycle, and then the latency just has to be rounded up to 3 since all uops on those ports are 1 or 3, never 2 (which simplifies the scheduler, I believe).
That would also explain the original 1/cycle throughput for pure SHLX with anomaly: as 2 non-pipelined cycles on the port, 2 ports = the observed throughput.
So it's sort of like a 2 uop instruction but can't _actually_ be 2 uops because "rename" is too late to crack something into 2 uops (that's already happened), so it just does the doubling up internally?