Hacker News new | ask | show | jobs
by diamondlovesyou 2437 days ago
I have some experience with this, ie ensuring LLVM optimizes and codegens the "best"! I have been working to generate target independent "kernels" for the Rav1e AV1 encoder and have had to do a lot of unidiomatic things to get LLVM to generate machine code similar in quality to hand written ASM. Granted, this is on integers and not floats, but the same principles should apply.

What I've found is that you need to ignore most of Rust: use/load raw pointers, don't use slices, unroll manually, vectorize manually, and check preconditions manually. You'll still get the amazing type system, but the code will have to be more C-like than Rust-like.

* raw pointers: LLVM is pretty good at optimizing C code. Rust specific optimization needs some work. (edit: I assumed arrays here, so you'll need the pointer for offsets; references are still okay. You'd also use the pointers for iterating instead of the usual slice iteration patterns)

* no slices: index checking is expensive, not to the CPU, the CPU rarely misses the check branches, but to the optimizer. I've found these are mostly left un-elided, even after inlining.

* no slices: slice indexing uses overflow checking. For Rav1e's case, the block/plane sizes mean that doing the index calculation using `u32` will never overflow, so calculating the offsets using u32 is fine (I'll have to switch to using a pseudo u24 integer for GPUs though, because u32 is still expensive on them).

* unroll manually: LLVM would probably do more of this with profiling info, but I've never found it (this is subjective!) to do any unrolling w/o. Maybe if all the other items here are also done...

* vectorize manually: Similar to unrolling. I've observed only limited automatic vectorization.

* And to get safety back: check, check, and check before calling the fast kernel! Ie wrap the kernel function with one that does all the checks elided in the kernel.

Source: Wrote https://github.com/xiph/rav1e/pull/1716, which speeds up the non-asm encodes by over 2x!

6 comments

the first order reason Rust and C++ differ in the article is because Rust will not pass the ffastmath flag to llvm, not because of any of this stuff.
Be that as it may, I wasn't talking about C++. Also, the integer operation optimizations require no such flags on either side, and yet the resulting machine code was still very poor for Rust.
Not sure I understand the part about raw pointers. As far as I understand, Rust references will surely turn into pointers at the LLVM IR level?
They do. The issue may be that references don't support pointer arithmetic – that is, given `x: &u32` you cannot get `x[1]`. The normal approach is to use a reference to a slice, `x: &[u32]`, and the default bounds-checked indexing. Rust does let you do unchecked indexing on slices (with `unsafe`), but it may be more efficient to avoid indices entirely in favor of pointer arithmetic. LLVM can often optimize index calculations into pointer arithmetic, but not always.

Edit: Although, on rereading the above post, I see diamondlovesyou did mention indices, so... not really sure what's going on.

True. I should have explained that better. If you're accessing an array, you'll have to do the pointer offset-ing yourself (otherwise you're using a slice and all the checking that entails). Thus, you can't use a reference type, because reference types don't have `<*const T>::add` like pointers do (also, `&T as usize` is invalid; you have to go through a pointer type first). I suppose I assumed you'd be accessing an array/slice; references otherwise are fine.
Rust references should in general optimize better because they give stronger aliasing guarantees.

Even for slices, using get_unchecked(1..) to get a smaller subslice without bounds checking might be better than pointer arithmetic as long as the slice lengths get optimized away (i.e. they are never used and never passed to non-inlined functions).

> Rust references should in general optimize better because they give stronger aliasing guarantees.

AFAIK this does not work atm due to a codegen bug in llvm (which can also affect code using restrict in C in some cases). This bug will get fixed one day, but most likely another bug will be revealed at this point… This part of LLVM was never really used in C as much as in Rust, so they keep finding bugs in it. Hopefully it will get fine in the long run, but I'm not holding my breath.

You can just pass a reference/mut ref to the first element of the slice. This is actually how the generated kernels in my Rav1e PR do it, just for the aliasing reason you mentioned.
You can use iterators to avoid unnecessary bounds checking on every element in a slice and you can still get an index to the value. Something like this:

``` for (i, val) in my_slice.iter().enumerate() { let x = *val + 9999; } ```

Not really in this case; Rav1e runs over blocks of image planes (like say a 16x16 block of a specific 720p color channel), so there is no continuous slice to iterate over.
Maybe you can use chunks_exact() and chunks_exact_mut(). The _exact versions allow lifting the bounds checks out of the loop and gave me some great performance boosts in image processing code.
These blocks are non-continuous. It operates over a part of a row, where the start of the next row is the start of the current row + some stride.

I mean maybe? But I probably wouldn't anyway for this case as a slice reference is actually a "fat" pointer (ie twice the size of a normal pointer) and the length of the slice won't be used (the block size is known per kernel); LLVM might delete the length part anyway.

These are automatically generated kernels, so readability isn't the primary concern here.

Once const-generics are released, I feel like you should be able to create your own "fixed-size slice" type, which uses the constant parameter for "bounds checking". I imagine that would be a lot more optimiser-friendly...
To sum up: LLVM is heavily optimized for C code, not so much for Rust. So Rust code has to imitate certain C mannerisms for the optimizer to kick in.

You have to pay the price of compatibility with the existing toolchain even if it's not your explicit goal.

I'm not sure this is the optimizer's fault. If I'm doing a bounds-checked array access that might panic, that panic is an observable side effect that cannot be reordered with respect to other observable side effects. That puts constraints on what the optimizer can do, not because it's not smart enough, but because it has to respect the meaning of the program.
there are aspects of Rust that are easier for llvm to optimize with than c. there are things rust isn't yet doing that it will do to make llvm be able to optimize it better as well.

the primary reason THIS code didn't optimize as well is that you can't pass ffast-math to llvm from rust in any useful way (yet)

Can anyone elaborate a bit on why this PR has such terrible compile times? As someone with only a passing familiarity of Rust (I do most of my work in C/C++ and generally expect sub-10s compile times), I would naively expect that writing more C-like code which uses fewer of Rust's language features would compile faster or at least not considerably slower. What's going on there?
The biggest Rust language feature is the borrow checker IMO, and unless you're running in unsafe, that's still doing its job. I always assumed the slowness was more based on the borrow checker analysis than on the error wrapping (option) stuff, slices, or functional bits.
I don't have anything solid backing this up, but I've often heard Rust proponents claim that the borrow checker is actually fairly cheap in terms of compile time overhead. Looking at the PR, it's obvious that it generates quite a bit of code (separate kernels for different block sizes, etc.) and does things like generics over tuples, but I don't know enough about Rust to pick out patterns that might be particularly slow to compile. It does have somewhat fewer unsafe blocks than I expected.
https://wiki.alopex.li/WhereRustcSpendsItsTime is a recent exploration of this topic that's very interesting.
That was interesting indeed, thanks. For an outsider, the most trivial takeaway from that is something along the lines of "Rust is slow to compile and it's probably not getting 10x faster any time soon". Beyond the obvious culprits like heavy metaprogramming, it looks like death by a thousand cuts (not to mention that ggez spends 15 seconds on something that's not even accounted for in the profile, that alone is huge) and there obviously isn't a common root cause for huge compile times in general.
Sounds like it's fair to say most of this boils down to bounds checks and SIMD?
> index checking is expensive

Seems a little gung ho to disable guaranteed index checking in a video codec no? I know you still do the checks, but it sounds like it's not in a statically guaranteed way.

Surely tight loops in performance-sensitive code, are precisely where you'd consider disabling runtime checks, no?

You can still have them enabled for debugging, at least. (Something not generally possible in C/C++, sadly.)

Depends how much you care about safety, and how much safety is needed. If you're doing an FEA analysis or whatever, fine disable all checking. But a video codec is guaranteed to be used in dangerous environments.

Also depends how much it affects performance. It it doubles it, ok I'll accept the risk. 10% faster? No thanks I'd rather play it safe.