| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jeffbee 1281 days ago

That's the whole gist of gchatelet's implementation. Make it easy for the compiler to inline memcpy, so the compiler gets a chance to propagate the information it has about the parameters. In many cases it can eliminate all the branches but one.

The GNU way of using a huge, branchy assembly block that is selected at runtime using ifunc means that the compiler never even got a chance.

Regarding the question of whether or not they are faster, see section 4.4 of this paper. Replacing the glibc memcmp with something trivial resulted in up to 1% speedup in Google web search, when considering the whole program. It doesn't microbenchmark as well as glibc memcmp, but it is a better function that doesn't wreck the performance of the rest of the system.

https://storage.googleapis.com/pub-tools-public-publication-...

1 comments

moonchild 1281 days ago

> Make it easy for the compiler to inline memcpy, so the compiler gets a chance to propagate the information it has about the parameters

I have low hopes for compilers. Inlining heuristics are terribly complicated, and optimisations that result therefrom will only be things that the compiler can prove for _all_ invocations. Inlining won't get you 'this call site is usually large', or 'this call site is small, but predictable', or 'this call site is really unpredictable, so use weird branchfree tricks'. (A JIT compiler might do better, but people do not usually JIT c or c++.)

> Replacing the glibc memcmp with something trivial resulted in up to 1% speedup in Google web search, when considering the whole program. It doesn't microbenchmark as well as glibc memcmp, but it is a better function that doesn't wreck the performance of the rest of the system

It's not all or nothing, and rep cmps is utter trash. My memcmp, for instance, I was careful to keep below 4 cache lines (or 3, considering only the version that doesn't overread)—<https://github.com/moon-chilled/fancy-memcmp/blob/master/mem...>—and it actually has reasonable throughput.

(Also: inlining is not the way to go if you are frontend-bound...)

menaerus 1281 days ago

> (Also: inlining is not the way to go if you are frontend-bound...)

When workload is frontend-bound, then the culprit is usually either in instruction cache misses (e.g. a lot of unpredictable virtual calls or generally poor code layout) or branch mispredictions (e.g. a lot of unpredictable branches in your code). I fail to see how inlining the code can be correlated to these two effects other than, what I believe is a myth, that inlining the code will by (1) growing the executable size (2) put more pressure on the instruction cache size and (3) therefore end up with degraded performance. In my experience, rarely I have seen even nr. (1) taking place (compilers and linkers are way too smart nowadays) and I think I have never managed to measure the effects of (2) and (3) in a non micro-benchmark scenario.

Anyway, if I were to optimize the workload found to be frontend-bound, eliminating branches and getting rid of the dynamic dispatch would be the first two things to check. Inlining maybe but only maybe in the third place.