Hacker News new | ask | show | jobs
by igrunert 283 days ago
A CMPXCHG16B instruction is going to be faster than a function call; and if the function is inlined there's still binary size cost.

The last processor without the CMPXCHG16B instruction was released in 2006 so far as I can tell. Windows 8.1 64-bit had a hard requirement on the CMPXCHG16B instruction, and that was released in 2013 (and is no longer supported as of 2023). At minimum Firefox should be building with -mcx16 for the Windows builds - it's a hard requirement for the underlying operating system anyway.

2 comments

Let me play devil's advocate: for some reason, functions such as strcpy in glibc have multiple runtime implementations and are selected by the dynamic linker at load time.
And there's a performance cost to that. If there was only one implementation of strcpy and it was the version that happens to be picked on my particular computer, and that implementation was in a header so that it could be inlined by my compiler, my programs would execute faster. The downside would be that my compiled program would only work on CPUs with the relevant instructions.

You could also have only one implementation of strcpy and use no exotic instructions. That would also be faster for small inputs, for the same reasons.

Having multiple implementations of strcpy selected at runtime optimizes for a combination of binary portability between different CPUs and for performance on long input, at the cost of performance for short inputs. Maybe this makes sense for strcpy, but it doesn't make sense for all functions.

> my programs would execute faster

You can't really state this with any degree of certainty when talking about whole-program optimization and function inlining. Even with LTO today you're talking 2-3% overall improvement in execution time, without getting into the tradeoffs.

Typically, making it possible for the compiler to decide whether or not to inline a function is going to make code faster compared to disallowing inlining. Especially for functions like strcpy which have a fairly small function body and therefore may be good inlining targets. You're right that there could be cases where the inliner gets it wrong. Or even cases where the inliner got it right but inlining ended up shifting around some other parts of the executable which happened to cause a slow-down. But inliners are good enough that, in aggregate, they will increase performance rather than hurt it.

> Even with LTO today you're talking 2-3% overall improvement in execution time

Is this comparing inlining vs no inlining or LTO vs no LTO?

In any case, I didn't mean to imply that the difference is large. We're literally talking about a couple clock cycles at most per call to strcpy.

What I was trying to point out is that you're essentially talking about LTO. Getting into the weeds, the compiler _can't_ optimize strcpy(*) in practice because its not going to be defined in a header-only library, it's going to be in a different translation unit that gets either dynamically or statically linked. The only way to optimize the function call is with LTO - and in practice, LTO only accounts for 2-3% of performance improvements.

And at runtime, there is no meaningful difference between strcpy being linked at runtime or ahead of time. libc symbols get loaded first by the loader and after relocation the instruction sequence is identical to the statically linked binary. There is a tiny difference in startup time but it's negligible.

Essentially the C compilation and linkage model makes it impossible for functions like strcpy to be optimized beyond the point of a function call. The compiler often has exceptions for hot stdlib functions (like memcpy, strcpy, and friends) where it will emit an optimized sequence for the target but this is the exception that proves the rule. In practice, the benefits of statically linking in dependencies (like you're talking about) does not have a meaningful performance benefit in my experience.

(*) strcpy is weird, like many libc functions its accessible via __builtin_strcpy in gcc which may (but probably won't) emit a different sequence of instructions than the call to libc. I say "probably" because there are semantics undefined by the C standard that the compiler cannot reason about but the linker must support, like preloads and injection. In these cases symbols cannot be inlined, because it would break the ability of someone to inject a replacement for the symbol at runtime.

> What I was trying to point out is that you're essentially talking about LTO. Getting into the weeds, the compiler _can't_ optimize strcpy(*) in practice because its not going to be defined in a header-only library, it's going to be in a different translation unit that gets either dynamically or statically linked.

Repeating the part of my post that you took issue with:

> If there was only one implementation of strcpy and it was the version that happens to be picked on my particular computer, and that implementation was in a header so that it could be inlined by my compiler, my programs would execute faster.

So no, I'm not talking about LTO. I'm talking about a hypothetical alternate reality where strcpy is in a glibc header so that the compiler can inline it.

There are reasons why strcpy can't be in a header, and the primary technical one is that glibc wants the linker to pick between many different implementations of strcpy based on processor capabilities. I'm discussing the loss of inlining as a cost of having many different implementations picked at dynamic link time.

Afaik runtime linkers can't convert a function call into a single non-call instruction.
Linux kernel has an interesting optimization using the ALTERNATIVE macro, where you can directly specify one of two instructions and it will be patched at runtime depending on cpu flags. No function calls needed (although you can have a function call as one of the instructions). It's a bit more messy in userspace where you have to respect platform page flags, etc. but it should be possible.
They could always just make the updater/installer install a version optimized for the CPU its going to be installed on.
It's not that uncommon to run one system on multiple CPUs. People swap out the CPU in their desktops, people move a drive from one laptop to another, people make bootable USB sticks, people set up a system in a chroot on a host machine and then flash a target machine with the resulting image.
Detect that on launch and use the updater to reinstall.
Congratulations, you now need to make sure your on-launch detector is compatible with the lowest common denominator, while at the same time being able to detect modern architectures. You also now carry 10 different instances of firefox.exe to support people eventually running on Itanium, people that will open support requests and expect you to fix their abandoned platform.

For what reason, exactly ?

You want 32b x86 support: pay for it. You want <obscure architecture> support: pay for it. If you're ok with it being a fork, then maintain it.