Hacker News new | ask | show | jobs
by matja 54 days ago
SUB has higher latency than XOR on some Intel CPUs:

latency (L) and throughput (T) measurements from the InstLatx64 project (https://github.com/InstLatx64/InstLatx64) :

  | GenuineIntel | ArrowLake_08_LC | SUB r64, r64 | L: 0.26ns=  1.00c  | T:   0.03ns=   0.135c |
  | GenuineIntel | ArrowLake_08_LC | XOR r64, r64 | L: 0.03ns=  0.13c  | T:   0.03ns=   0.133c |
  | GenuineIntel | GoldmontPlus    | SUB r64, r64 | L: 0.67ns=  1.0 c  | T:   0.22ns=   0.33 c |
  | GenuineIntel | GoldmontPlus    | XOR r64, r64 | L: 0.22ns=  0.3 c  | T:   0.22ns=   0.33 c |
  | GenuineIntel | Denverton       | SUB r64, r64 | L: 0.50ns=  1.0 c  | T:   0.17ns=   0.33 c |
  | GenuineIntel | Denverton       | XOR r64, r64 | L: 0.17ns=  0.3 c  | T:   0.17ns=   0.33 c |
I couldn't find any AMD chips where the same is true.
2 comments

.03ns is a frequency of 33 GHz. The chip doesn't actually clock that fast. What I think you're seeing is the front end detecting the idiom and directing the renamer to zero that register and just remove that instruction from the stream hitting the execution resources.
SUB does not have higher latency than XOR on any Intel CPU, when those operations are really performed, e.g. when their operands are distinct registers.

The weird values among those listed by you, i.e. those where the latency is less than 1 clock cycle, are when the operations have not been executed.

There are various special cases that are detected and such operations are not executed in an ALU. For instance, when the operands of XOR/SUB are the same the operation is not done and a null result is produced. On certain CPUs, the cases when one operand is a small constant are also detected and that operation is done by special circuits at the register renamer stage, so such operations do not reach the schedulers for the execution units.

To understand the meaning of the values, we must see the actual loop that has been used for measuring the latency.

In reality, the latency measured between truly dependent instructions cannot be less than 1 clock cycle. If a latency-measuring loop provides a time that when divided by the number of instructions is less than 1, that is because some of those instructions have been skipped. So that XOR-latency measuring loop must have included XORs between identical operands, which were bypassed.