Hacker News new | ask | show | jobs
by Someone 3 days ago
> so instead it's piecemeal implementations mostly in numeric packages like eigen and lapack.

Because that’s where the user-noticeable gains can be made. Using popcount in code you run once is going to shave off, maybe, 100 cycles. That isn’t worth the extra cycles of that approach.

Also, FTA: “and arguably the whole scheme should be replaced by finer-grained feature detection”. Such feature detection would lead to a combinatorial explosion of different binaries.

Finally, where it really matters, it’s not only a matter of recompiling the same code. For optimal performance, you also want to change loop unrolling strategy, stride count, etc.

3 comments

Based on the now-deprecated Clear Linux it does seem that these optimizations add up [0] and so maybe we should be considering them more broadly?

[0] https://www.phoronix.com/review/clear-linux-48p-ubuntu/6

AFAIK, that wasn’t solely a matter of picking the optimal compilation flags. It also included profile-guided optimizations and kernel tweaks.
Doesn't that get into the domain of a distro like Gentoo though? (Or sort of Nix as well) - rebuild everything to precisely target your actual architecture.
Yes, and the maximally performant configurations are at odds with a usefully hardened security posture. This only gets more true as time goes on and additional mitigations pile up.
POPCNT is an interesting example. Runtime dispatch (with a conditional branch) would actually make sense for it because it's comparatively difficult to implement from scratch. PDEP and PEXT might be similar (but I don't think compilers pattern-match for it, unlike POPCNT). AArch64 uses localized run-time dispatch extensively because LL/SC atomics are so very bad on current cores, but there isn't anything comparable in the x86-64 space (POPCNT isn't that frequent).

For many other things, like using a YMM register to copy a 32-byte struct or a variable shift, run-time dispatch just not make sense. You will only see a benefit if you generate this code unconditionally. For FMA, you wouldn't even get bit-identical output, leading to testing concerns.

> Also, FTA: “and arguably the whole scheme should be replaced by finer-grained feature detection”. Such feature detection would lead to a combinatorial explosion of different binaries.

the thread is about runtime detection tbf

I see I wasn’t clear enough. The tool I discussed generates multiple binaries and then packs all of them into a single binary. I was referring to the former.

https://github.com/ronnychevalier/cargo-multivers:

“After building the different versions, it computes a hash of each version and it filters out the duplicates (i.e., the compilations that gave the same binaries despite having different CPU features). Finally, it builds a runner that embeds one version compressed (the source) and the others as compressed binary patches to the source. For instance, when building for the target x86_64-pc-windows-msvc, by default 4 different versions will be built, filtered, compressed, and merged into a single portable binary.

When executed, the runner uncompresses and executes the version that matches the CPU features of the host.”

Hopefully (and likely) the patches will not be too large, but for 6 binary compiler flags, you’d still have 2⁶ binaries.

Yeah, but that's because of pragmatic choices to limit the scope of the tool. In the wider context of "I've long been surprised there isn't more multiversion stuff built right into every language compile" it's easy to imagine a compiler that can heuristically detect which functions would benefit from certain CPU features, and walk over the call graph to find locations for runtime feature detection that balance detection overhead with code duplication for the fallback functions. For example merging the feature detection of adjacent function calls, making sure feature detection is moved out of hot loops, etc.

Obviously this is much easier to imagine than to implement. And in some languages it might be made impossible by certain language features (function pointers might become tricky). But this is more or less what some people do by hand in Rust with the more manual is_x86_feature_detected macro, so there's no obvious reasons why compilers couldn't automate it in at least some languages.

Ok, then it will be an explosion of binary size, if you have several code blocks optimized for each architecture level - I'm not very familiar with the subject, but I imagine it would have to be relatively large chunks of code, otherwise the constant branching would eat up the speed advantage.
These are usually pretty tight loops or constructs based on specific features.

An unspecialised popcnt is half the dozen instructions, for specialised versions it’s 4 implementations ranging from half a dozen to two dozen bytes.