Hacker News new | ask | show | jobs
by userbinator 4089 days ago
It's a lot smaller (1 byte vs 6), which means less space spent in the cache and decoder, reducing cache misses and decode bandwidth. The x86 also has a dedicated "stack engine" since the Pentium M (but not suprisingly, absent in NetBurst), which contains an adder and copy of the stack pointer to handle push/pop operations. This is faster than using the general-purpose ALUs and memory read/write ports, and also frees those up for use by other non-stack instructions. On the other hand, it means reading/writing the stack pointer explicitly between implicit stack operations incurs a little extra latency to get the values between the stack engine and "real" ESP register synchronised.

Memory reads/writes do take a few more cycles to complete, but since this is a write, the CPU can continue on with other non-dependent instructions following it. All the above information assumes a CPU based on P6 and its successors (Core, Nehalem, Sandy Bridge, Ivy Bridge, Haswell, etc.); NetBurst and Atom are very different.

Linus also has some interesting things to say about using the dedicated stack instructions: http://yarchive.net/comp/linux/pop_instruction_speed.html

Somewhat amusingly, GCC was well known to generate the explicit sub/mov instructions by default, while most other x86 C compilers I knew of, including MSVC and ICC, would always use push.

1 comments

Thanks to you and awhitworth! Very interesting stuff, kept me reading. (And soon searching to understand some of the ideas. And thinking of Linus's puzzle about `call` being faster than a `push` before a jump. Seems to be one of those cases where higher-level abstractions can be optimized better than lower-level ones. I suppose because lower-level ones are too general-purpose, while higher-level ones are constrained.)