|
|
|
|
|
by jeffbee
1281 days ago
|
|
That's the whole gist of gchatelet's implementation. Make it easy for the compiler to inline memcpy, so the compiler gets a chance to propagate the information it has about the parameters. In many cases it can eliminate all the branches but one. The GNU way of using a huge, branchy assembly block that is selected at runtime using ifunc means that the compiler never even got a chance. Regarding the question of whether or not they are faster, see section 4.4 of this paper. Replacing the glibc memcmp with something trivial resulted in up to 1% speedup in Google web search, when considering the whole program. It doesn't microbenchmark as well as glibc memcmp, but it is a better function that doesn't wreck the performance of the rest of the system. https://storage.googleapis.com/pub-tools-public-publication-... |
|
I have low hopes for compilers. Inlining heuristics are terribly complicated, and optimisations that result therefrom will only be things that the compiler can prove for _all_ invocations. Inlining won't get you 'this call site is usually large', or 'this call site is small, but predictable', or 'this call site is really unpredictable, so use weird branchfree tricks'. (A JIT compiler might do better, but people do not usually JIT c or c++.)
> Replacing the glibc memcmp with something trivial resulted in up to 1% speedup in Google web search, when considering the whole program. It doesn't microbenchmark as well as glibc memcmp, but it is a better function that doesn't wreck the performance of the rest of the system
It's not all or nothing, and rep cmps is utter trash. My memcmp, for instance, I was careful to keep below 4 cache lines (or 3, considering only the version that doesn't overread)—<https://github.com/moon-chilled/fancy-memcmp/blob/master/mem...>—and it actually has reasonable throughput.
(Also: inlining is not the way to go if you are frontend-bound...)