Hacker News new | ask | show | jobs
by darkmighty 282 days ago
> Maybe they could also drop support for older x86_64 CPU's, releasing more optimised builds

Question: Don't optimizers support multiple ISA versions, similar to web polyfill, and run the appropriate instructions at runtime? I suppose the runtime checks have some cost. At least I don't think I've ever run anything that errored out due to specific missing instructions.

5 comments

There was recent story about f-droid running ancient x86-64 build servers and having issues due lacking isa extensions

https://news.ycombinator.com/item?id=44884709

but generally it is rare to see higher than x86-64-v3 as a requirement, and that works with almost all CPUs sold in the past 10+ years (Atoms being prominent exception).

A CMPXCHG16B instruction is going to be faster than a function call; and if the function is inlined there's still binary size cost.

The last processor without the CMPXCHG16B instruction was released in 2006 so far as I can tell. Windows 8.1 64-bit had a hard requirement on the CMPXCHG16B instruction, and that was released in 2013 (and is no longer supported as of 2023). At minimum Firefox should be building with -mcx16 for the Windows builds - it's a hard requirement for the underlying operating system anyway.

Let me play devil's advocate: for some reason, functions such as strcpy in glibc have multiple runtime implementations and are selected by the dynamic linker at load time.
And there's a performance cost to that. If there was only one implementation of strcpy and it was the version that happens to be picked on my particular computer, and that implementation was in a header so that it could be inlined by my compiler, my programs would execute faster. The downside would be that my compiled program would only work on CPUs with the relevant instructions.

You could also have only one implementation of strcpy and use no exotic instructions. That would also be faster for small inputs, for the same reasons.

Having multiple implementations of strcpy selected at runtime optimizes for a combination of binary portability between different CPUs and for performance on long input, at the cost of performance for short inputs. Maybe this makes sense for strcpy, but it doesn't make sense for all functions.

> my programs would execute faster

You can't really state this with any degree of certainty when talking about whole-program optimization and function inlining. Even with LTO today you're talking 2-3% overall improvement in execution time, without getting into the tradeoffs.

Typically, making it possible for the compiler to decide whether or not to inline a function is going to make code faster compared to disallowing inlining. Especially for functions like strcpy which have a fairly small function body and therefore may be good inlining targets. You're right that there could be cases where the inliner gets it wrong. Or even cases where the inliner got it right but inlining ended up shifting around some other parts of the executable which happened to cause a slow-down. But inliners are good enough that, in aggregate, they will increase performance rather than hurt it.

> Even with LTO today you're talking 2-3% overall improvement in execution time

Is this comparing inlining vs no inlining or LTO vs no LTO?

In any case, I didn't mean to imply that the difference is large. We're literally talking about a couple clock cycles at most per call to strcpy.

What I was trying to point out is that you're essentially talking about LTO. Getting into the weeds, the compiler _can't_ optimize strcpy(*) in practice because its not going to be defined in a header-only library, it's going to be in a different translation unit that gets either dynamically or statically linked. The only way to optimize the function call is with LTO - and in practice, LTO only accounts for 2-3% of performance improvements.

And at runtime, there is no meaningful difference between strcpy being linked at runtime or ahead of time. libc symbols get loaded first by the loader and after relocation the instruction sequence is identical to the statically linked binary. There is a tiny difference in startup time but it's negligible.

Essentially the C compilation and linkage model makes it impossible for functions like strcpy to be optimized beyond the point of a function call. The compiler often has exceptions for hot stdlib functions (like memcpy, strcpy, and friends) where it will emit an optimized sequence for the target but this is the exception that proves the rule. In practice, the benefits of statically linking in dependencies (like you're talking about) does not have a meaningful performance benefit in my experience.

(*) strcpy is weird, like many libc functions its accessible via __builtin_strcpy in gcc which may (but probably won't) emit a different sequence of instructions than the call to libc. I say "probably" because there are semantics undefined by the C standard that the compiler cannot reason about but the linker must support, like preloads and injection. In these cases symbols cannot be inlined, because it would break the ability of someone to inject a replacement for the symbol at runtime.

Afaik runtime linkers can't convert a function call into a single non-call instruction.
Linux kernel has an interesting optimization using the ALTERNATIVE macro, where you can directly specify one of two instructions and it will be patched at runtime depending on cpu flags. No function calls needed (although you can have a function call as one of the instructions). It's a bit more messy in userspace where you have to respect platform page flags, etc. but it should be possible.
They could always just make the updater/installer install a version optimized for the CPU its going to be installed on.
It's not that uncommon to run one system on multiple CPUs. People swap out the CPU in their desktops, people move a drive from one laptop to another, people make bootable USB sticks, people set up a system in a chroot on a host machine and then flash a target machine with the resulting image.
Detect that on launch and use the updater to reinstall.
Congratulations, you now need to make sure your on-launch detector is compatible with the lowest common denominator, while at the same time being able to detect modern architectures. You also now carry 10 different instances of firefox.exe to support people eventually running on Itanium, people that will open support requests and expect you to fix their abandoned platform.

For what reason, exactly ?

You want 32b x86 support: pay for it. You want <obscure architecture> support: pay for it. If you're ok with it being a fork, then maintain it.

As far as I can tell, GCC supports compiling multiple versions of a function, but can't automatically decide which functions to do that for, or how many versions to build targeting different instruction set extensions. The programmer needs to explicitly annotate each function, meaning it's not practical to do this for anything other than obvious hot spots.
You can do that to some limited degree, but not really.

There are more relevant modern examples, but one example that I really think illustrates the issue well is floating point instructions. The x87 instruction set is the first set of floating point instructions for x86 processors, first introduced in the late 80s. In the late 90s/early 2000s, Intel released CPUs with the new SSE and SSE2 extensions, with a new approach to floating point (x87 was really designed for use with a separate floating point coprocessor, with a design that's unfortunate now that CPUs have native floating point support).

So modern compilers generate SSE instructions rather than the (now considered obsolete) x87 instructions when working with floating point. Trying to run a program compiled with a modern compiler on a CPU without SSE support will just crash with an illegal instruction exception.

There are two main ways we could imagine supporting x87-only CPUs while using SSE instructions on CPUs with SSE:

Every time the compiler wants to generate a floating point instruction (or sequence of floating point instructions), it could generate the x87 instruction(s), the SSE instruction(s), and a conditional branch to the right place based on SSE support. This would tank performance. Any performance saving you get from using an SSE instruction instead of an x87 instruction is probably going to be outweighed by the branch.

The other option is: you could generate one x87 version and one SSE version of every function which uses floats, and let the dynamic linker sort out function calls and pick the x87 version on old CPUs and the SSE version on new CPUs. This would more or less leave performance unaffected, but it would, in the worst case, almost double your code size (since you may end up with two versions of almost every function). And in fact, it's worse: the original SSE only supports 32-bit floats, while SSE2 supports 64-bit floats; so you want one version of every function which uses x87 for everything (for the really old CPUs), one version of every function which uses x87 for 64-bit floats and SSE for 32-bit floats, and you want one function which uses SSE and SSE2 for all floats. Oh, and SSE3 added some useful functions; so you want a fourth version of some functions where you can use instructions from SSE3, and use a slower fallback on systems without SSE3. Suddenly you're generating 4 versions of most functions. And this is only from SSE, without considering other axes along which CPUs differ.

You have to actively make a choice here about what to support. It doesn't make a sense to ship every possible permutation of every function, you'd end up with massive executables. You typically assume a baseline instruction set from some time in the past 20 years, so you're typically gonna let your compiler go wild with SSE/SSE2/SSE3/SSE4 instructions and let your program crash on the i486. For specific functions which get a particularly large speed-up from using something more exotic (say, AVX512), you can manually include one exotic version and one fallback version of that function.

But this causes the problem that most of your program is gonna get compiled against some baseline, and the more constrained that baseline is, the more CPUs you're gonna support, but the slower it's gonna run (though we're usually talking single-digit percents faster, not orders of magnitude faster).

I consider it unlikely, but perhaps there's some instructions that don't have a practical polyfill for x86?
The only thing that comes to mind is some form of atomic instructions that need to interact with other code in well defined ways. I don't see how you could polyfill cmpxchg16b for example.