Really interesting. Normal uops don't work like that, they are always pipelined, so a p06 op with 3-cycle latency would always be 3/0.5, not 3/1.
So the 1-throughput strikes me as a renamer limit, not an execution limit. I.e., these instructions can only flow through the renamer at 1 per cycle.
This could perhaps be tested by interleaving unrelated p06 ops in the same ratio as SHLX, and see if they + SHLX are able to saturate p06, and also trying different interleaving granularities like 1:1 and 5:5 since I would expect those to behave differently in the renamer (but not much different in execution).
Interleaving CQO and SHLX results in ~1.33 throughput with the anomaly, ~2.0 without. This ratio is more or less constant whether it's 1:1 or 2:2 or 4:4 or 8:8 (with 1:1 it's slightly lower at ~1.28).
This may or may not be consistent with one CQO uop being executed once a cycle as expected, and one SHLX uop taking a a spot (stalling for one cycle?) for 2 cycles, resulting in a runtime of (x/2 * 1 + x/2 * 2)/2 ~ x/1.33 cycles.
Well very interesting. I don't really read that as supporting my idea of a renamer limit: seems like the throughput would either be higher (assuming the CQO just goes "for free" in the same rename cycle) or lower and also depend heavy on the interleaving (assuming the SHLX uses a whole rename cycle by itself or something like that).
So maybe it's like you say: SHLX with the anomaly is only semi-pipelined: it takes 2 cycles before the next SHLX can dispatch. Perhaps it needs to use 1 cycle to handle the unfolding of the folded immediates, then the shift happens in the second cycle, and then the latency just has to be rounded up to 3 since all uops on those ports are 1 or 3, never 2 (which simplifies the scheduler, I believe).
That would also explain the original 1/cycle throughput for pure SHLX with anomaly: as 2 non-pipelined cycles on the port, 2 ports = the observed throughput.
So it's sort of like a 2 uop instruction but can't _actually_ be 2 uops because "rename" is too late to crack something into 2 uops (that's already happened), so it just does the doubling up internally?
Noise from the surrounding code aside, we see the same number of uops issued. However in the anomaly case, ~1/4th of the cycles are spent with no uops being executed, 1/2 are spent with only 1 uop being executed, and around 1/4 of the cycles have 2 uops being executed. I expected 0 and 2 being 50/50, consistently with there being one cycle stall, but if the uops are desynched and issued one cycle apart it would also explain the 1 being so prominent.
To confirm this I add an LFENCE at the start of each loop iteration to serialize the pipeline and try to ensure that each SHLX pair is issued in the same cycle. And it works: