|
|
|
|
|
by duckingtest
4295 days ago
|
|
One argument? Sure, probably no difference. When you start using longs as local variables and arguments, even if they're all in registers in one function, if you call other functions inside they are going to be pushed on the stack. It all adds up and suddenly you're getting L1 cache misses. Anyway, unnecessary conversions mostly go away when you use link time optimizations (fwhole-program or flto in gcc). |
|
By contrast, when trying to optimize inner loops, I frequently encounter cases where the front-end limitation of 4 micro-ops per cycle is a limiting factor, and getting rid of any extraneous instruction is a speedup. And rather than worrying about a deep stack causing L1 data misses, I'm more concerned with missing L1 instruction cache, or with the extra micro-ops causing me to miss the ~1000 slot decoded micro-op cache.
These concerns are clearly at opposite ends of the performance spectrum, and which should dominate probably depends on the problem at hand.
(I glanced at your comment history. Welcome to HN! You have good insights. Please stick around.)