| HN Mirror

I think that's because (at least on some CPUs) it takes three macro-operations. http://www.agner.org/optimize/microarchitecture.pdf (section 17.4, page 188):

"Vector path instructions are less efficient than single or double instructions because they require exclusive access to the decoders and pipelines and do not always reorder optimally. For example:

    ; Example 17.1. AMD instruction breakdown
    xchg  eax, ebx   ; Vector path, 3 ops
    nop              ; Direct path, 1 op
    xchg  ecx, edx   ; Vector path, 3 ops
    nop              ; Direct path, 1 op

This sequence takes 4 clock cycles to decode because the vector path instructions must decode alone."