|
|
|
|
|
by BeeOnRope
3060 days ago
|
|
It doesn't decode to a single fused uop, but rather 2 fused pairs of uops (so 4 total unfused uops). So there is fusion going on, twice (the load and ALU op are fused, and the two store uops are fused). If you use the three-instruction sequence the load and ALU op can't fuse, which potentially makes it slower (but not in this case since the bottleneck is elsewhere). |
|
http://www.agner.org/optimize/instruction_tables.pdf