| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by userbinator 4089 days ago

It's a lot smaller (1 byte vs 6), which means less space spent in the cache and decoder, reducing cache misses and decode bandwidth. The x86 also has a dedicated "stack engine" since the Pentium M (but not suprisingly, absent in NetBurst), which contains an adder and copy of the stack pointer to handle push/pop operations. This is faster than using the general-purpose ALUs and memory read/write ports, and also frees those up for use by other non-stack instructions. On the other hand, it means reading/writing the stack pointer explicitly between implicit stack operations incurs a little extra latency to get the values between the stack engine and "real" ESP register synchronised.

Memory reads/writes do take a few more cycles to complete, but since this is a write, the CPU can continue on with other non-dependent instructions following it. All the above information assumes a CPU based on P6 and its successors (Core, Nehalem, Sandy Bridge, Ivy Bridge, Haswell, etc.); NetBurst and Atom are very different.

Linus also has some interesting things to say about using the dedicated stack instructions: http://yarchive.net/comp/linux/pop_instruction_speed.html

Somewhat amusingly, GCC was well known to generate the explicit sub/mov instructions by default, while most other x86 C compilers I knew of, including MSVC and ICC, would always use push.

1 comments

calibraxis 4089 days ago

Thanks to you and awhitworth! Very interesting stuff, kept me reading. (And soon searching to understand some of the ideas. And thinking of Linus's puzzle about `call` being faster than a `push` before a jump. Seems to be one of those cases where higher-level abstractions can be optimized better than lower-level ones. I suppose because lower-level ones are too general-purpose, while higher-level ones are constrained.)

link