| > may be a handful of microoperations merged in with a stream of other instructions. It may indeed in theory, but most of the time in real-world it won't. And BTW - let's not forget about implicit costs of being unable to inline (which can be huuuuuge). Speaking of the real-world - we've just got our article accepted, presenting supposedly the fastest universal hashing function as of today (in spite of math being more heavy compared to the existing ones) - and the numbers in the OP are consistent with our real-world experiences while we were optimising it (well, within an order of magnitude at least). > but there is a tone of precision and certainty to them that is a bit deceptive. OP: "Last but not least, a word of caution: all the estimates here are just indications of the order of magnitude". > Talking about how much an operation costs is pointless; build things one way, measure that, then make the smallest change possible and measure that. Sure. The problem is that almost-nobody has time to do it in real-world projects. Which leads to even cruder estimates (such as "virtual function calls costs are negligible, regardless of the number of times they're called") being used - causing lots of crazily inefficient programs. IMO, OP is a reasonable middle ground between an all-out quasi-stationary testing and even worse guesstimates ;-). |
Congratulations on getting a paper accepted (and would be interested in reading about it, as I love hashing work) but your claims about "most of the time in real-world it won't" is nonsense. The typical x86 function call overhead is a correctly predicted call, some register saves (using an optimized stack engine), the real work, and some register restores (using the same optimized stack engine). This is not typically 15-30 cycles worth of overhead. The loads and stores generally go to and from L1 cache, are generally predictable (if the function itself is predictable) and none of the operations that are part of a conventional function call are all that expensive. In 15-30 cycles you can do 30-60 loads and/or 15-30 stores (mileage varies by uarch) and 60-120 integer operations if you are very, very lucky. Compare this with the typical argument setup, function prolog/epilog, etc.
As you hint at, function call overhead generally comes from interrupting live ranges (forcing save/restore or simply causing the register allocator to pick a worse strategy) and losing the opportunity to optimize across function boundaries - this cost can be enormous, nebulous and not even a constant (it will impose costs repeatedly on the program and isn't even a constant). I have code where the cost of sticking a function call in and losing some constant propagation information is millions of cycles per call, not 15-30. In other places the cost of a call is effectively zero.
In still other places the cost of a function call is 'negative' - that is, it's cheaper to have a function exist as a independent unit than inline it. This is typically an i-cache issue but we've seen a host of weird effects here.
So - under some (unusual) circumstances, function call overhead can be practically free (i.e. the ooo engine has a lot of opportunities to insert prolog/epilog instructions, no branch mispredicts, etc). Typically it will be inherently cheaper than 15-30 cycles - but lost opportunities for optimization may take you to numbers that are insanely higher than that.
The reason I arced up over this is that it's just not useful to say "within an order of magnitude". If you say 15-30 is the magic number you are creating folklore. We are burdened by folkloric stuff ("the interpreter penalty that results because branch predictors 'can't predict indirect branches'") that result in considerably worse designs. It's better to know you don't know rather than promulgate simplistic rules-of-thumb that misstate the real issues.
This is particularly true because a lot of this 'folkloric optimization' stuff people do generally leads away from a simple and direct expression of their ideas in their favorite coding style, and towards a heavily "optimized" form that's super-obscure because some "performance high priest" has declared it Performant. We've had a number of cheap laughs in our day re-rolling loops, replacing inline asm with C code, and restoring sanity and getting a performance win from doing it. I've been just as guilty of playing Performance High Priest as anyone else, mind you.