Hacker News new | ask | show | jobs
by Someone 3412 days ago
I think that's because (at least on some CPUs) it takes three macro-operations. http://www.agner.org/optimize/microarchitecture.pdf (section 17.4, page 188):

"Vector path instructions are less efficient than single or double instructions because they require exclusive access to the decoders and pipelines and do not always reorder optimally. For example:

    ; Example 17.1. AMD instruction breakdown
    xchg  eax, ebx   ; Vector path, 3 ops
    nop              ; Direct path, 1 op
    xchg  ecx, edx   ; Vector path, 3 ops
    nop              ; Direct path, 1 op
This sequence takes 4 clock cycles to decode because the vector path instructions must decode alone."
1 comments

This is indeed a good explanation - I admit I was not aware of this detail of the K8/K10 processors.