Hacker News new | ask | show | jobs
by teux 1295 days ago
I often hand write neon (and other vectorised architecture) intrinsics/assembly for my job, optimising image and signal processing routines. We have seen many many 3 digit percentage speedups from bare c/c++ code.

I got into the nastiest discussion on reddit where people were swearing up and down it was impossible to beat the compiler, and handwritten assembly was useless/pretentious/dangerous. I was downvoted massively. Sigh.

Anyways, that was a year ago. Thanks for another point of validation for that. It clearly didn’t hurt my feelings. :)

I never come across people in the wild that actually do this also, it’s such a niche area of expertise.

8 comments

It also slightly annoys me a bit the things JIT people write on their github READMEs about the incredibly theoretical improvements that can happen at runtime, yet it's never anywhere close to AOT compilation. Then you can add 2-3x on top of that for hand-written assembly.

I do wonder whats going on with projects like BOLT though. I have seen it was merged into LLVM, and I have tried to use it but the improvement was never more than 7%. I feel like it has a lot of potential because it does try to take run-time into account.

See: https://github.com/llvm/llvm-project/tree/main/bolt

> improvement was never more than 7%.

If your use case isn't straining icache then you won't benefit as much.

BTW 7% is huge, odd that you would describe it as "only".

> BTW 7% is huge, odd that you would describe it as "only".

It depends on what you're doing and how optimized the baseline performance is. In my area (CRDTs) the baseline performance is terrible for a lot of these algorithms. Over about 18 months of work I've managed to improve on automerge's 2021 performance of ~5 minutes / 800MB of ram for one specific benchmark down to 4ms / 2MB. Thats 75000x faster. (Yjs in comparison takes ~1 second.)

Almost all of the performance improvement came from using more appropriate data structures and optimizing the fast-path. I couldn't find an off-the-shelf b-tree or skip list which did what I needed here. I ended up hand coding a b-tree for sequence data which run-length encodes items internally, and knows how to split and merge nodes when inserts happen. CRDTs also have a lot of fiddly computations when concurrent changes edit the same data, but users don't do that much in practice. Coding optimized fast paths for the 99% case got me another 10x performance improvement or so.

I'd take another 7% performance improvement on top of where I am, but this code is probably fast enough. I hear you that 7% is huge sometimes, and a smarter compiler is a better compiler. But 7% is a drop in the bucket for my work.

7% is huge in the context of compilers, which optimize general-purpose code.
7% is in the ballpark of the speedup most programs get from changing the allocator to not give almost every allocation with the same huge alignment and around half the speedup most programs get from using explicit huge pages. These changes are both a lot easier, but e.g. Microsoft doesn't think it's worthwhile to allow developers to make the latter change at all, over 26 years after the feature shipped in consumer Intel CPUs.
That's unfortunate. I wrote a VMM that tries to back memory with hugepages (even the guests page tables). It's making a difference!
> about the incredibly theoretical improvements that can happen at runtime

Which in the majority of cases can be achieved by profile guided optimization anyways.

It should be part of these discussions to proof what you claim. Always. With code samples, directly to the compiler and corresponding assembler.

https://godbolt.org/

Statistics are worthless alone, at the end all that counts is the arena of performance and what the code becomes and how it runs against the handcrafted version.

Godbolt doesn’t accurately show runtime speed of algorithms on input data, which is what you need when discussing simd performance. And often these are proprietary industry algorithms that are the core of a business’s model.

I’m all for transparency but I’m also not about to get fired for posting our kernel convolution routines, or least squares fit model.

> It should be part of these discussions to proof what you claim

Further - these aren’t subjective claims that need to be proven on a forum for legitimacy. It’s the literal state of vector based optimisations in the compiler world right now. It is a hard problem and for the time being humans are much better at it. This is quite a large area of academic research at the moment.

If someone is so uninformed of this domain that they don’t know this, the burden is on that person to learn what the industry is talking about. Not the people discussing the objective state of the industry.

Godbolt takes practice to read. Often people who are incapable of understaning when you can beat a compiler cannot also be shown a Godbolt snippet in good faith.
This is always deeply frustrating. You quickly get the sense that the person you're talking to hasn't experienced anything beyond simple float loops that are trivial for the compiler to autovectorize, or really bad examples of hand vectorization.

In the meantime, I constantly encounter algorithms that compilers fail to vectorize because even single vector instructions are too complex for the compiler to match, such as saturating integer adds. The compiler fails to autovectorize and the difference in performance is >5x. Even just something simple like adding up unsigned bytes, and all three major compilers generate vector code that's much slower than a simple loop leveraging absolute difference instructions.

That's even before running into the more complex operations that would require the compiler to match half a dozen lines of code, like ARM's Signed Saturating Rounding Doubling Multiply Accumulate returning High Half: https://developer.arm.com/architectures/instruction-sets/int...

Or cases where the compiler is not _allowed_ to apply vector optimizations by itself, because changes to data structures are required.

> it’s such a niche area of expertise.

It is that! As I stated in another comment, this niche ended up saving the company literally millions of dollars in hardware costs. To be fair, it really should only be done in highly performance-critical situations and only after an initial implementation is already written in pure C++ with thorough unit testing for _all_ input/output cases.

What does the company you work for do?
The company I wrote the hand-optimized code does DNA analysis. I worked there mostly because I needed a job. DNA analysis certainly isn't within my passion domain.

Now I work for a company to writing software to control drones. It doesn't pay as much as I want, but it's at least fun and there's a ton of unsolved automation problems at the company. And we're hiring like crazy -- from 6 people to 300+ in the past two years, and there's no sign of slowing down. And I'm now a manager too... that's the really scary part lol

> I got into the nastiest discussion on reddit where people were swearing up and down it was impossible to beat the compiler, and handwritten assembly was useless/pretentious/dangerous.

It _should_ be useless (for some reasonable definition of "should") — it just isn't in practice. And I'm continually amazed at how often people confuse one for the other, across all contexts. E.g. I have family members who refuse to consider that our justice system might have deep flaws, because to their mind if it should be some other way then it already would be.

Isn't compiler optimization np complete? I don't think I'd put anything there in "should". Yeah any single optimization (or permutation thereof) can be applied, but they're order-dependent and the combinatorial explosion means you can't try to apply all of them.
What makes you think that humans are better at solving np complete problems?
Not in general. But humans can exploit prior knowledge to select which avenues to pursue first.
Tell them to read the ffmpeg code. All the platform-specific/SIMD stuff is done in asm.

This isn't only because it's faster, it's honestly easier to read than intrinsics anyway. What it does lack is debugability.

Or any other highly optimised numerical codebase. From a quick glance at OpenBLAS, it looks like they have a lot of microarchitecture-specific assembly code, with dispatching code to pick out the appropriate implementations.

https://github.com/xianyi/OpenBLAS/blob/02ea3db8e720b0ffb3e2...

https://github.com/xianyi/OpenBLAS/blob/02ea3db8e720b0ffb3e2...

For debugging you can actually use gdb in assembly tui mode and step through the instructions! You can even get it hooked up in vs code and remote debug an embedded target using the full IDE. Full register view, watch registers for changes, breakpoints, step instruction to instruction.

Pipelining and optimisations can make the intrinsics a bit fucky though, have to make sure it’s -O0 and a proper debug compilation.

I have line by line debugged raw assembly many times. It’s just a pain to initially set up. Honestly not very different from c/c++ debugging once running.

Sure, but gdb doesn't know what the function parameters are, or on some platforms where functions start and end, crashes don't have source lines, and ASan doesn't work. (though of course valgrind does)
If you are handwriting the function in assembly, you'll know what registers hold the function parameters, what types of values they are supposed to be, and with care, you can produce debug information and CFI directives to allow for stack unwinding, it's just annoying to do - but that's just the tradeoff you make for the performance improvement I suppose.
I don’t know if this is frowned upon or not among assembly programmers, but I often just use naked functions in C with asm bodies, which gdb will provide the args for, rather than linking against a separate assembly file.
If you write your assembly to look like C code GDB is more than happy to provide you with much of that to the extent that it can. In particular, it will identify functions and source mappings from debug symbols.
Pipelining and optimizations…when reading in the debugger? I don't quite understand how this is relevant.
ffmpeg might have amazingly efficient inner loops (i.e. low-level decoding/encoding), but the broader architecture (e.g. memory buffer implementations, etc) is quite inefficient. Like the low-level media code it's not that each component itself is inefficient, it's that the interfaces and control flow semantics between them obstruct both compiler and architectural optimizations.

When I wrote a transcoding multimedia server I ended up writing my own framework and simply pulling in the low-level decoders/encoders, most of which are maintained as separate libraries. I ended up being able to push at least an order of magnitude more streams through the server than if I had used ffmpeg (more specifically, libavcodec) itself, even though I still effectively ended up with an abstraction layer intermediating encoder and format types. And I never wrote a single line of assembly.

There's no secret sauce to optimization: it's not about using assembly, fancier data structures, etc; it's learning to identify impedance mismatches, and those exist up and down the stack. Sometimes a "dumber" data structure or algorithm can create opportunities (for the developer, for the compiler) for more harmonious data and code flow. And impedance mismatches sometimes exist beyond the code--e.g. mismatch between functionality and technical capabilities, where your best might be to redefine the problem, which can often be done without significantly changing how users experience the end product.

> most of which are maintained as separate libraries

This is so confusing I can’t tell if you’re actually talking about libavcodec. The whole point is to combine codecs to share common code, “most” decoders certainly aren’t available elsewhere.

If you just want to call libx264 directly go ahead and do that of course. libx264 uses assembly just as much or more than libavcodec though.

I have a lot of sympathy for wanting efficient code. But let's indeed have a look: https://github.com/FFmpeg/FFmpeg/blob/7bbad32d5ab69cb52bc92a... There are so many macros, %if and clutter here that it's difficult (for me?) to keep the big picture in mind.

This reminds me of a retrospective of an OS/window manager written in assembly - they were great about avoiding tiny overheads, but expressed regret that the whole system ended up slow because it was hard to reason about bigger things such as how often to redraw everything, similar to what people are saying here.

To be clear: let's indeed optimize and vectorize, but better to build on intrinsics than go all the way down to assembly.

I prefer intrinsics over assembly.

There're too many different assemblies: inline, MASM, NASM, FASM, YASM. They come with their unique quirks, and they complicate build.

Intrinsics are more portable. It's trivial to re-compile legacy SSE intrinsics into AVX1. You won't automatically get 32-byte vectors this way, but you will get VEX encoding, broadcasts for _mm_set1_something, and more.

Readability depends on the code style. When you write intrinsics using "assembly with types" style, actual assembly is indeed more readable. OTOH, with C++ it's possible to make intrinsics way better than assembly: arithmetic operators instead of vaddpd/vsubpd/vmulpd/vdivpd, strongly-typed classes wrapping low-level vectors for specific use cases, etc.

Update: most real-life functions contain scalar code (like loops), also auto-generated code (stack frame setup, back up / restore of non-volatile registers). When coding non-inline assembly, developer needs to do that manually in assembly, this can be hard to do, and may cause bugs like these https://github.com/openssl/openssl/issues/12328 https://news.ycombinator.com/item?id=33705209

FFmpeg code is god-awful. A lot of it is like from 2002 and written without regards to any sort of "sanity". People who write assembly routines these days have a structure to their code, and if they overrun buffers or whatever they'll document what alignment assumptions they're making. FFmpeg will just start patching its own code at runtime because someone thought it was a good idea on Pentium processors.
Nice!

> I never come across people in the wild that actually do this also, it’s such a niche area of expertise.

I'd guesstimate there are several hundred of us working on vectorization :)

I would guess you're off by an order of magnitude or two.
Interesting. Any thoughts on where they can be found?
A lot of them don’t have accounts on the internet. Many do their job part-time as the need arises.
> impossible to beat the compiler

Ludicrous! How could they be taken seriously? Which subreddit was this?

Aren’t all speedups 3 digit percentage? As in 100% or more? Did you mean 100x?
I think when people measure speedups here they deduct the first 100%, e.g. if I used to be able to process 10 items per second and can now process 20 items per second you can run 200% as fast but it's a 100% speedup.
To avoid the various ambiguities here, I learned to express speedups as a ratio of the optimized throughput, divided by the baseline throughput. Or equivalently: the baseline time divided by optimized time.

4400 MB/s vs 800 MB/s = 5.5x speedup or 5.5 times as fast.

A three digit percentage means at least a quarter of the time, but possibly more.

Something that is 50% faster will take half the time. Something that is 100% faster will take a quarter, and so on.

Something that is 50% faster takes 2/3 as long sir.