| I often hand write neon (and other vectorised architecture) intrinsics/assembly for my job, optimising image and signal processing routines. We have seen many many 3 digit percentage speedups from bare c/c++ code. I got into the nastiest discussion on reddit where people were swearing up and down it was impossible to beat the compiler, and handwritten assembly was useless/pretentious/dangerous. I was downvoted massively. Sigh. Anyways, that was a year ago. Thanks for another point of validation for that. It clearly didn’t hurt my feelings. :) I never come across people in the wild that actually do this also, it’s such a niche area of expertise. |
I do wonder whats going on with projects like BOLT though. I have seen it was merged into LLVM, and I have tried to use it but the improvement was never more than 7%. I feel like it has a lot of potential because it does try to take run-time into account.
See: https://github.com/llvm/llvm-project/tree/main/bolt