Hacker News new | ask | show | jobs
by calibraxis 4089 days ago
Why does `push eax` perform "much faster" than the following?

  sub esp, 4
  mov DWORD PTR SS:[esp], eax
A brief skim of Intel's "Software Developer’s Manual" (particularly ch. 6 on stacks), didn't seem to find an answer.

While hitting the ALU just for `sub` might be an extra step, doesn't hitting RAM make that a drop in the bucket? (`sub` may account for less than 1%?) Or is there some caching going on, so RAM may be updated in the background?

(I'm not an assembly programmer; very ignorant of what's happening.)

4 comments

It's a lot smaller (1 byte vs 6), which means less space spent in the cache and decoder, reducing cache misses and decode bandwidth. The x86 also has a dedicated "stack engine" since the Pentium M (but not suprisingly, absent in NetBurst), which contains an adder and copy of the stack pointer to handle push/pop operations. This is faster than using the general-purpose ALUs and memory read/write ports, and also frees those up for use by other non-stack instructions. On the other hand, it means reading/writing the stack pointer explicitly between implicit stack operations incurs a little extra latency to get the values between the stack engine and "real" ESP register synchronised.

Memory reads/writes do take a few more cycles to complete, but since this is a write, the CPU can continue on with other non-dependent instructions following it. All the above information assumes a CPU based on P6 and its successors (Core, Nehalem, Sandy Bridge, Ivy Bridge, Haswell, etc.); NetBurst and Atom are very different.

Linus also has some interesting things to say about using the dedicated stack instructions: http://yarchive.net/comp/linux/pop_instruction_speed.html

Somewhat amusingly, GCC was well known to generate the explicit sub/mov instructions by default, while most other x86 C compilers I knew of, including MSVC and ICC, would always use push.

Thanks to you and awhitworth! Very interesting stuff, kept me reading. (And soon searching to understand some of the ideas. And thinking of Linus's puzzle about `call` being faster than a `push` before a jump. Seems to be one of those cases where higher-level abstractions can be optimized better than lower-level ones. I suppose because lower-level ones are too general-purpose, while higher-level ones are constrained.)
I don't know exactly why, but I have some theories. First, the two instructions you list there aren't parallelizable, because you have a write to ESP and then a read from it, so the first instruction needs to complete first (or, get far enough in the pipeline for the partial result to be usable). This is almost certainly going to cause some stalls.

Beyond that, the best explanations I've seen are that the push/pop instructions are "optimized" in some way. I don't know if that means they are more optimized than just making the inc/read pair on the ESP atomic so it doesn't stall or if there is more to it.

It is quite hard to envision a scenario where the current stack frame wasn't in any cache level.
Ah ok, it has something to do with caching. So the cost of `sub` can be noticeable. Thanks!
It's about dependency chains (and fewer bytes, easier decoding, and fewer µops). The OOO core can do more if it has more (but shorter) dependency chains to work with.

http://www.agner.org/optimize/microarchitecture.pdf

§7.7.