| I have some experience with this, ie ensuring LLVM optimizes and codegens the "best"! I have been working to generate target independent "kernels" for the Rav1e AV1 encoder and have had to do a lot of unidiomatic things to get LLVM to generate machine code similar in quality to hand written ASM. Granted, this is on integers and not floats, but the same principles should apply. What I've found is that you need to ignore most of Rust: use/load raw pointers, don't use slices, unroll manually, vectorize manually, and check preconditions manually. You'll still get the amazing type system, but the code will have to be more C-like than Rust-like. * raw pointers: LLVM is pretty good at optimizing C code. Rust specific optimization needs some work. (edit: I assumed arrays here, so you'll need the pointer for offsets; references are still okay. You'd also use the pointers for iterating instead of the usual slice iteration patterns) * no slices: index checking is expensive, not to the CPU, the CPU rarely misses the check branches, but to the optimizer. I've found these are mostly left un-elided, even after inlining. * no slices: slice indexing uses overflow checking. For Rav1e's case, the block/plane sizes mean that doing the index calculation using `u32` will never overflow, so calculating the offsets using u32 is fine (I'll have to switch to using a pseudo u24 integer for GPUs though, because u32 is still expensive on them). * unroll manually: LLVM would probably do more of this with profiling info, but I've never found it (this is subjective!) to do any unrolling w/o. Maybe if all the other items here are also done... * vectorize manually: Similar to unrolling. I've observed only limited automatic vectorization. * And to get safety back: check, check, and check before calling the fast kernel! Ie wrap the kernel function with one that does all the checks elided in the kernel. Source: Wrote https://github.com/xiph/rav1e/pull/1716, which speeds up the non-asm encodes by over 2x! |