Hacker News new | ask | show | jobs
by jbk 815 days ago
> Today you'll have to work real hard to beat LLVM in optimizing basic SIMD code.

On dav1d, we see just a 800% increase… I know it’s negligible, but…

1 comments

Compared to what? Scalar loopy C code sure. The auto vectorization is not great.

But give LLVM some SIMD code as input, and it will be able to optimize it, and it does a great job with register allocation, spill code, instruction scheduling etc.

Instruction selection isn't as great and you still need to use intrinsics for specialized instructions.

And you get all of this for all CPU architectures and will deal with future microarchitecture changes for free. E.g. more execution ports added by Intel will get used with no code changes on your side.

With infinite time you can still do better by hand, but it gets expensive fast, especially if you have several CPU architectures to deal with.

Blog author (and dav1d/ffmpeg dev) here. My talk at VDD 2023 (https://www.youtube.com/watch?v=Z4DS3jiZhfo&t=9290s) did a comparison like the ones asked above. I compared an intrinsics implementation of the AV1 inverse transform with the hand-written assembly one found in dav1d. I analyzed the compiler-generated version (from intrinsics) versus the hand-written one in terms of instruction count (and cycle runtime, too) for different types of things a compiler does (data loads, stack spills, constant loading, actual multiply/add math, addition of result to predictor, etc.). Conclusion: modern compilers still can't do what we can do by hand, the difference is up to 2x - this is a huge difference. It's partially because compilers are not as clever as everyone likes to think, but also because humans are more clever and can choose to violate ABI rules if it's helpful, which a compiler cannot do. Is this hard? Yes. But at some scale, this is worth it. dav1d/FFmpeg are examples of such scale.
I am happily using compilers for just register allocation, spills, and scheduling in these use cases, but my impression is that compiler authors don't really consider compilers to be especially good at inputs like this where the user has already chosen the instructions and just wants register allocation, spills, and scheduling. The problem is known to be hard and the solutions we have are mostly heuristics tuned for very small numbers of live variables, which is the opposite of what you have if you want to do anything while you wait for your multiplies to be done.

Instruction selection as you said is mostly absent. Compilers will not substitute or for blend or shift for shuffle even in cases where they are trivially equivalent, so the programmer has to know what execution ports are available anyway =/

> Scalar loopy C code sure. The auto vectorization is not great.

Stop considering people as idiots.

People do that because it’s a LOT faster, not just a bit.

If you are so able, please show us your results. Dav1d is full open source, fully documented, and with quite simple C code.

Show your results.

> show us your results

Not GP but here’s an example where intrinsics outperformed assembly by an order of magnitude: https://news.ycombinator.com/item?id=36624240

They were AVX2 SIMD intrinsics versus scalar assembly, but I doubt AVX2 assembly gonna substantially improve performance of my C++. The compiler did a decent job allocating these vector registers and the assembly code is not too bad, not much to improve.

It’s interesting how close your 800% to my 1000%. For this reason, I have a suspicion you tested the opposite, naïve C or C++ versus SIMD assembly. Or maybe you have tested automatically vectorized C or C++ code, automatic vectorizers often fail to deliver anything good.

So you took asm code that had no SIMD instructions in it, made your own version in c++ with intrinsics and figured out that, yes, SIMD is faster? Realy?

I think you're completely missing what are we talking about here.

> Realy?

No, not really. My point is, in modern compilers SSE and AVX intrinsics are usually pretty good, and assembly is not needed anymore even for very performance-sensitive use cases like video codecs or numerical HPC algorithms.

I think in the modern world it’s sufficient for developers to be able to read assembly, to understand what compilers are doing to their codes. However, writing assembly is not the best idea anymore.

Assembly is unreliable due to OS-specific shenanigans, result in bugs like that one: https://issues.chromium.org/issues/40185629

Assembly complicates builds because inline assembly is not available in all compilers, and for non-inline assembly every project uses a different version: YASM, NASM, MASM, etc.

> and assembly is not needed anymore even for very performance-sensitive use cases like video codecs

People in this thread, writing video codecs for years that you use daily tell you that, no, it’s a lot faster (10-20%), but you, who have done none of those, know better…