Hacker News new | ask | show | jobs
by Glacia 815 days ago
It's funny reading comments here, Hackernews bros really think they're smarter than the guys who made ffmpeg/x264/x265 etc. Dunning–Kruger effect in action.
2 comments

First you write in C. Then it's too slow, you write parts in asm. After a while you have a lot of asm. You lean harder on the macro processor. At some point you port to another processor and decide the macro processor can handle that. Bang, x86inc.asm.

That doesn't make the end point optimal. Nor does it mean it's what the authors would have done from a clean slate. At each step you take the sensible choice and after a long trek down the gradient you end up somewhere like this.

Given a desire to write something analogous to these codecs today, should you copy their development path? Should you try to copy the end result, maybe even using the same tools?

Your argument from authority amounts to "these guys are clever, you should imitate them". There are failure modes in that line of thinking which I hope the above makes clear.

We aren't stupid. We designed x86inc to be like that for good reasons, from a clean slate, and if we didn't like it we would have done something else.

You haven't tested the alternatives - they're slow and don't work in this situation, mostly because C is not actually that low level when it comes to memory aliasing.

Well that's much more interesting. Is there anything written publicly about the experience? Any tooling used to help get the implementation right beyond testing and writing it carefully?

I've found a little here https://ffmpeg.org/developer.html#SIMD_002fDSP-1

The context is I'm a compiler developer who really liked working side by side with old school assembly developers in a past role. I'm painfully aware that the tribal knowledge of building stuff out of asm is hard to find written down and always curious about the directions in which things like C can be extended to narrow the gap.

FWIW I have been writing SIMD since 20+ years and worked on JPEG XL, which also contains a good bit of vector code.

BTW one anecdote: a colleague mentioned what should have been a quick 20 min patch to ffmpeg took a day because it was written in assembly.

That "20 minute patch" will need to be maintained for decades to come in FFmpeg, long after a standalone JPEG-XL library. Potentially centuries as archives like the Library of Congress are storing FFmpeg. So that's why it's done in assembly, so it's maintainable with the rest of the code.
Here is the old archived blog post where x264 team answered in the comments why they do it that way. https://web.archive.org/web/20091223024333/http://x264dev.mu...
This is dated to 2009. Probably sound advise back then.

Compilers are much much better with SIMD code than they were then. Today you'll have to work real hard to beat LLVM in optimizing basic SIMD code (edit: when given SIMD code as input, see comment below).

I happen to know because this "hacker news bro" has been dealing with SIMD code for longer than that.

FFmpeg code by definition is not "basic SIMD code". And it supports numerous other compilers other than LLVM.
> Today you'll have to work real hard to beat LLVM in optimizing basic SIMD code.

On dav1d, we see just a 800% increase… I know it’s negligible, but…

Compared to what? Scalar loopy C code sure. The auto vectorization is not great.

But give LLVM some SIMD code as input, and it will be able to optimize it, and it does a great job with register allocation, spill code, instruction scheduling etc.

Instruction selection isn't as great and you still need to use intrinsics for specialized instructions.

And you get all of this for all CPU architectures and will deal with future microarchitecture changes for free. E.g. more execution ports added by Intel will get used with no code changes on your side.

With infinite time you can still do better by hand, but it gets expensive fast, especially if you have several CPU architectures to deal with.

Blog author (and dav1d/ffmpeg dev) here. My talk at VDD 2023 (https://www.youtube.com/watch?v=Z4DS3jiZhfo&t=9290s) did a comparison like the ones asked above. I compared an intrinsics implementation of the AV1 inverse transform with the hand-written assembly one found in dav1d. I analyzed the compiler-generated version (from intrinsics) versus the hand-written one in terms of instruction count (and cycle runtime, too) for different types of things a compiler does (data loads, stack spills, constant loading, actual multiply/add math, addition of result to predictor, etc.). Conclusion: modern compilers still can't do what we can do by hand, the difference is up to 2x - this is a huge difference. It's partially because compilers are not as clever as everyone likes to think, but also because humans are more clever and can choose to violate ABI rules if it's helpful, which a compiler cannot do. Is this hard? Yes. But at some scale, this is worth it. dav1d/FFmpeg are examples of such scale.
I am happily using compilers for just register allocation, spills, and scheduling in these use cases, but my impression is that compiler authors don't really consider compilers to be especially good at inputs like this where the user has already chosen the instructions and just wants register allocation, spills, and scheduling. The problem is known to be hard and the solutions we have are mostly heuristics tuned for very small numbers of live variables, which is the opposite of what you have if you want to do anything while you wait for your multiplies to be done.

Instruction selection as you said is mostly absent. Compilers will not substitute or for blend or shift for shuffle even in cases where they are trivially equivalent, so the programmer has to know what execution ports are available anyway =/

> Scalar loopy C code sure. The auto vectorization is not great.

Stop considering people as idiots.

People do that because it’s a LOT faster, not just a bit.

If you are so able, please show us your results. Dav1d is full open source, fully documented, and with quite simple C code.

Show your results.