| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by glangdale 3513 days ago

I have been working in this area for 10 years, made millions of dollars of revenue from a product that was largely dependent on getting this sort of thing right, sold a startup to Intel whose main business was in this area, and work for Intel. I'm not 100% sure you are well situated to lecture me about the 'real world' because you wrote an academic paper.

Congratulations on getting a paper accepted (and would be interested in reading about it, as I love hashing work) but your claims about "most of the time in real-world it won't" is nonsense. The typical x86 function call overhead is a correctly predicted call, some register saves (using an optimized stack engine), the real work, and some register restores (using the same optimized stack engine). This is not typically 15-30 cycles worth of overhead. The loads and stores generally go to and from L1 cache, are generally predictable (if the function itself is predictable) and none of the operations that are part of a conventional function call are all that expensive. In 15-30 cycles you can do 30-60 loads and/or 15-30 stores (mileage varies by uarch) and 60-120 integer operations if you are very, very lucky. Compare this with the typical argument setup, function prolog/epilog, etc.

As you hint at, function call overhead generally comes from interrupting live ranges (forcing save/restore or simply causing the register allocator to pick a worse strategy) and losing the opportunity to optimize across function boundaries - this cost can be enormous, nebulous and not even a constant (it will impose costs repeatedly on the program and isn't even a constant). I have code where the cost of sticking a function call in and losing some constant propagation information is millions of cycles per call, not 15-30. In other places the cost of a call is effectively zero.

In still other places the cost of a function call is 'negative' - that is, it's cheaper to have a function exist as a independent unit than inline it. This is typically an i-cache issue but we've seen a host of weird effects here.

So - under some (unusual) circumstances, function call overhead can be practically free (i.e. the ooo engine has a lot of opportunities to insert prolog/epilog instructions, no branch mispredicts, etc). Typically it will be inherently cheaper than 15-30 cycles - but lost opportunities for optimization may take you to numbers that are insanely higher than that.

The reason I arced up over this is that it's just not useful to say "within an order of magnitude". If you say 15-30 is the magic number you are creating folklore. We are burdened by folkloric stuff ("the interpreter penalty that results because branch predictors 'can't predict indirect branches'") that result in considerably worse designs. It's better to know you don't know rather than promulgate simplistic rules-of-thumb that misstate the real issues.

This is particularly true because a lot of this 'folkloric optimization' stuff people do generally leads away from a simple and direct expression of their ideas in their favorite coding style, and towards a heavily "optimized" form that's super-obscure because some "performance high priest" has declared it Performant. We've had a number of cheap laughs in our day re-rolling loops, replacing inline asm with C code, and restoring sanity and getting a performance win from doing it. I've been just as guilty of playing Performance High Priest as anyone else, mind you.

1 comments

no-bugs 3513 days ago

> made millions of dollars of revenue... I'm not 100% sure you are well situated to lecture me about the 'real world' because you wrote an academic paper.

So, we're going to discuss millions of dollars instead, sigh... This way Trump should be one of the best programmers in the universe, I guess.

> Typically it will be inherently cheaper than 15-30 cycles - but lost opportunities for optimization may take you to numbers that are insanely higher than that.

And still, even from your own rant it follows that in vast majority of cases the estimate of 15-30 cycles will be well within "order of magnitude" (TBH, I didn't see millions myself, but it should be a really strange corner case).

> If you say 15-30 is the magic number you are creating folklore... We are burdened by folkloric stuff... that result in considerably worse designs.

And not having any such "folklore" results in even worse designs :-( (actually - MUCH worse ones). Using list instead of vector can easily give you 100x penalty for absolutely zero reason (actually, up to 780x was experimentally observed - and that's without swapping). Off-loading 100-cycle chunks (with a thread context switch back after calculating these 100 cycles) to a different thread will never work at least on a x64 (though I remember meeting some folks from Intel - I think they were representing an OpenMP team - who were seriously preaching otherwise, based on utterly silly "how many cores we managed to utilise" metric without realising that the whole thing became _slower_ after they parallelised it ;-( ). And so on and so forth.

Sure, the numbers are very rough. But trying to say that "hey, it is not precise so let's not even try to estimate" - is even worse than that.

link

glangdale 3513 days ago

I don't think Trump made his money doing low-level performance programming for the past 10 years, so I'm not sure your analogy is valid.

However, since you, whoever you are, have not only written a hash table, but discovered profundities like 'Sometimes list costs 780x as much as vector for "absolutely zero reason"', and 'don't try to offload 100 cycles of work to another thread' I'm going to defer to your expertise. I recommend you stick a bone through your beard, pronounce yourself a performance guru, and make bank. Have fun.

link

no-bugs 3512 days ago

> I recommend you stick a bone through your beard, pronounce yourself a performance guru, and make bank. Have fun.

:-) :-) I LOVE when my opponent has to resort to personal insults :-). Leaving aside any sarcastic remarks in this regard:

For Intel's sake I Really Hope that these "profundities" are indeed very well-known to you - and believe it or not, they're very well-known to me too for at least 10 years. However, this is not the point; the point is that there are LOTS of developers out there who do NOT know them - and the OP is intended for them (and not for "performance gurus").

It is actually THIS simple. Eliminating 90% of inefficiency does not really require black magic or "performance gurus" who know exactly how the pipeline of specific CPU works. And this is exactly what I'm arguing for - to educate app-level developers and architects about this low-hanging fruit of 10x+ inefficiencies; I can assure you that it is very far from being universal knowledge in app-level development circles.

link