|
Trying some perf events confirms that there is no extra inserted uop. Going back to the SHLX R[i], R[i], RCX loop, we have: No anomaly: 2,190,954,207 cpu_core/cycles:u/ ( +- 0.14% )
4,412,790,656 cpu_core/uops_issued.any:u/ ( +- 0.11% )
39,386,389 cpu_core/exe_activity.1_ports_util:u/ ( +- 11.57% )
2,121,401,346 cpu_core/exe_activity.2_ports_util:u/ ( +- 0.11% )
6,015,432 cpu_core/exe_activity.exe_bound_0_ports:u/ ( +- 8.87% )
593,599,670 cpu_core/uops_retired.stalls:u/ ( +- 0.85% )
Anomaly: 4,357,567,336 cpu_core/cycles:u/ ( +- 0.15% )
4,448,899,140 cpu_core/uops_issued.any:u/ ( +- 0.26% )
2,107,051,688 cpu_core/exe_activity.1_ports_util:u/ ( +- 0.14% )
1,106,699,503 cpu_core/exe_activity.2_ports_util:u/ ( +- 0.13% )
1,129,497,409 cpu_core/exe_activity.exe_bound_0_ports:u/ ( +- 0.42% )
2,502,226,997 cpu_core/uops_retired.stalls:u/ ( +- 0.38% )
Noise from the surrounding code aside, we see the same number of uops issued. However in the anomaly case, ~1/4th of the cycles are spent with no uops being executed, 1/2 are spent with only 1 uop being executed, and around 1/4 of the cycles have 2 uops being executed. I expected 0 and 2 being 50/50, consistently with there being one cycle stall, but if the uops are desynched and issued one cycle apart it would also explain the 1 being so prominent.To confirm this I add an LFENCE at the start of each loop iteration to serialize the pipeline and try to ensure that each SHLX pair is issued in the same cycle. And it works: 4,581,269,346 cpu_core/cycles:u/ ( +- 0.10% )
4,556,347,404 cpu_core/uops_issued.any:u/ ( +- 0.12% )
133,363,872 cpu_core/exe_activity.1_ports_util:u/ ( +- 7.73% )
2,082,838,530 cpu_core/exe_activity.2_ports_util:u/ ( +- 0.24% )
2,165,817,614 cpu_core/exe_activity.exe_bound_0_ports:u/ ( +- 0.06% )
3,090,362,239 cpu_core/uops_retired.stalls:u/ ( +- 0.16% )
Now the uops are split between 0 and 2 executed per cycle, as theorized. |