| HN Mirror

Trying some perf events confirms that there is no extra inserted uop. Going back to the SHLX R[i], R[i], RCX loop, we have:

No anomaly:

     2,190,954,207      cpu_core/cycles:u/                                                      ( +-  0.14% )
     4,412,790,656      cpu_core/uops_issued.any:u/                                             ( +-  0.11% )
        39,386,389      cpu_core/exe_activity.1_ports_util:u/                                        ( +- 11.57% )
     2,121,401,346      cpu_core/exe_activity.2_ports_util:u/                                        ( +-  0.11% )
         6,015,432      cpu_core/exe_activity.exe_bound_0_ports:u/                                        ( +-  8.87% )
       593,599,670      cpu_core/uops_retired.stalls:u/                                         ( +-  0.85% )

Anomaly:

     4,357,567,336      cpu_core/cycles:u/                                                      ( +-  0.15% )
     4,448,899,140      cpu_core/uops_issued.any:u/                                             ( +-  0.26% )
     2,107,051,688      cpu_core/exe_activity.1_ports_util:u/                                        ( +-  0.14% )
     1,106,699,503      cpu_core/exe_activity.2_ports_util:u/                                        ( +-  0.13% )
     1,129,497,409      cpu_core/exe_activity.exe_bound_0_ports:u/                                        ( +-  0.42% )
     2,502,226,997      cpu_core/uops_retired.stalls:u/                                         ( +-  0.38% )

Noise from the surrounding code aside, we see the same number of uops issued. However in the anomaly case, ~1/4th of the cycles are spent with no uops being executed, 1/2 are spent with only 1 uop being executed, and around 1/4 of the cycles have 2 uops being executed. I expected 0 and 2 being 50/50, consistently with there being one cycle stall, but if the uops are desynched and issued one cycle apart it would also explain the 1 being so prominent.

To confirm this I add an LFENCE at the start of each loop iteration to serialize the pipeline and try to ensure that each SHLX pair is issued in the same cycle. And it works:

     4,581,269,346      cpu_core/cycles:u/                                                      ( +-  0.10% )
     4,556,347,404      cpu_core/uops_issued.any:u/                                             ( +-  0.12% )
       133,363,872      cpu_core/exe_activity.1_ports_util:u/                                        ( +-  7.73% )
     2,082,838,530      cpu_core/exe_activity.2_ports_util:u/                                        ( +-  0.24% )
     2,165,817,614      cpu_core/exe_activity.exe_bound_0_ports:u/                                        ( +-  0.06% )
     3,090,362,239      cpu_core/uops_retired.stalls:u/                                         ( +-  0.16% )

Now the uops are split between 0 and 2 executed per cycle, as theorized.