|
|
|
|
|
by BeeOnRope
524 days ago
|
|
Really interesting. Normal uops don't work like that, they are always pipelined, so a p06 op with 3-cycle latency would always be 3/0.5, not 3/1. So the 1-throughput strikes me as a renamer limit, not an execution limit. I.e., these instructions can only flow through the renamer at 1 per cycle. This could perhaps be tested by interleaving unrelated p06 ops in the same ratio as SHLX, and see if they + SHLX are able to saturate p06, and also trying different interleaving granularities like 1:1 and 5:5 since I would expect those to behave differently in the renamer (but not much different in execution). |
|
This may or may not be consistent with one CQO uop being executed once a cycle as expected, and one SHLX uop taking a a spot (stalling for one cycle?) for 2 cycles, resulting in a runtime of (x/2 * 1 + x/2 * 2)/2 ~ x/1.33 cycles.