|
|
|
|
|
by Someone
3412 days ago
|
|
I think that's because (at least on some CPUs) it takes three macro-operations. http://www.agner.org/optimize/microarchitecture.pdf (section 17.4, page 188): "Vector path instructions are less efficient than single or double instructions because they require exclusive access to the decoders and pipelines and do not always reorder optimally. For example: ; Example 17.1. AMD instruction breakdown
xchg eax, ebx ; Vector path, 3 ops
nop ; Direct path, 1 op
xchg ecx, edx ; Vector path, 3 ops
nop ; Direct path, 1 op
This sequence takes 4 clock cycles to decode because the vector path instructions must decode alone." |
|