Hacker News new | ask | show | jobs
by yshui 261 days ago
That's a cool find. I wonder if LLVM also does the other way around operation, where it pattern matches handwritten CAS loops and transform them into native ARM64 instructions.
3 comments

That's a very good question. A proper compiler engineer would know, but I will do my best to find something and report back.

Edit: I could not find any pass with a pattern matching to replace CAS loops. The closest thing I could find is this pass: https://github.com/llvm/llvm-project/blob/06fb26c3a4ede66755... I reckon one could write a similar pass to recognize CAS idioms, but its usefulness would be probably rather limited and not worth the effort/risks.

The term of art for this technique is "idiom recognition" and it's proper ancient, like, APL compilers did have some idiom recognition 50+ years ago.

An example you'll see in say a modern C compiler is that if you write the obvious loop to calculate how many bits are set in an int, the actual machine code on a brand new CPU should be a single population count instruction, C provides neither intrinsics (like Rust) not a dedicated "popcount" feature, so you can't write that but it's obviously what you want here and yup an optimising C compiler will do that.

However, LLVM is dealing with an IR generated by other compiler folk so I think it probably has less use for idiom recognition. Clang would do the recognition and lower to the same LLVM IR as Rust does for its intrinsic population count core::intrinsics::ctpop so the LLVM backend doesn't need to spot this. I might be wrong, but I think that's how it works.

> An example you'll see in say a modern C compiler is that if you write the obvious loop to calculate how many bits are set in an int, the actual machine code on a brand new CPU should be a single population count instruction, C provides neither intrinsics (like Rust) not a dedicated "popcount" feature, so you can't write that but it's obviously what you want here and yup an optimising C compiler will do that.

C compilers definitely have intrinsics for this, for GCC for instance it is `__builtin_popcount`.

And apparently it has even standard language support for it since C23, it's `stdc_count_ones` [1] and in C++ you have `std::popcount` [2]

[1] https://en.cppreference.com/w/c/numeric/bit_manip.html [2] https://en.cppreference.com/w/cpp/numeric/popcount.html

The existence of platform specific hacks is not interesting. In reality what happens is that software which has at any point cared about being portable doesn't use them.

But yes stdc_count_ones is indeed the intrinsic you'd want here, and only a few years after I stopped writing C, so thanks for mentioning that.

std::popcount is C++ but it's also kinda miserable that it took until C++ 20 and yet they still only landed the unsigned integer types, even though C++ 20 also insists the signed integers have two's complement representation, so the signed integers do have these desirable properties in fact but you can't use that.

> In reality what happens is that software which has at any point cared about being portable doesn't use them.

I don't think this generalization is actually true. Fast portable software compiles conditionally based on the target platform, picking the fast platform-specific intrinsic, and falls back to a slow but guaranteed portable software implementation. This pattern is widespread in numerical linear algebra, media codecs, data compressors, encryption, graphics, etc.

Maybe we are just quibbling over semantics but the compiler intrinsic here is '__builtin_popcount'. 'stdc_count_ones' is a standard library element that presumably will be implemented using the intrinsic.

And FWIW all major C/C++ have for a long time have had a an intrinsic for this. In clang it even has the same name, Visual Studio it's something like just '_popcount'. So it has long been easy to roll your own macro that works everywhere.

Yes, just semantics. But I don't think I can agree that because you could have ensured this works portably people actually did. That's not been my experience.

Yesterday I watched that "Sea of Thieves" C++ 14 to C++ 20 upgrade story on Youtube, that feels much more like what I've seen - code that shouldn't have worked but it did, kept alive by people whose priority is a working game.

__builtin_popcount is not platform specific.
OK, sure, vendor specific then. C23 does not promise this incantation, it's presumably a GCCism.
I checked Godbolt, with RISC-V instead of ARM since I'm more familiar with that, and it doesn't look like it.

https://gcc.godbolt.org/z/b5s4WjnTG

(amomax is the atomic fetch-max instruction. lr and sc are load-reserved and store-conditional instructions; sc is like a regular store except it only succeeds if the address was not modified since the previous lr that accessed it. IOW the assembly is basically one-to-one with the C source.)