Hacker News new | ask | show | jobs
by throw0101b 1191 days ago
> Various assembler optimisations to a number of different algorithms (e.g. AES-GCM, ChaCha20, SM3, SM4, SM4-GCM) across multiple processor architectures

With modern compilers, how often (or in what circumstances) is it worth "hand-rolling" assembler code versus just letting the compiler do it? Does one make the assembler 'from scratch', or perhaps let the compiler generate the assembler and have a human look at it to see if there are any places it can be improved?

5 comments

I think cryptography is one of the few places where it makes sense to do that. Because:

* There's not that much code involved.

* Many CPUs have instructions specifically made for accelerating cryptographic algorithms.

* Security may have specific requirements from the code, such as not giving away any secrets through timing. This may require writing very specific, suboptimal code intentionally.

... and keeping critical pieces of code as much independent as possible from the very few grotesquely and absurdely massive and complex optimizing compilers is always a good idea.
It's very worth doing in this context ... almost all of the assembly I've written in the last ten years has been on routines like this. Compilers are very smart, but it's hard for them to optimize concurrent and interleaved cryptographic algorithms to be cache pipeline efficient and operation efficient at the same time.

AES-GCM is "AES" and "GCM" running at the same time on the same data. ChaCha20 is "ChaCha20" and "Poly1305" running at the same time on the same data, usually block by block so that you avoid pulling data into cache more than once. You can interleave their imperative operations in C, or Rust code (or whatever) ... but the compiler isn't going to intuit how some of the math can be re-used across the algorithms without a lot of hints, or how it can be safely vectorized, and at that point you might as well just write the assembly.

If you look at the output of your compiler many unnecessary loads/stores. Vectorized code in particular still comes out lacking even with intrinsics

In fact, you can benchmark openssl's assembly vs openssl's C: https://github.com/openssl/openssl/blob/master/crypto/aes/ae...

Granted, they aren't using intrinsics in that code, but a sufficiently smart compiler shouldn't need intrinsics

Compilers are capable of very effective optimizations, but they need certain guarantees to be able to apply them and sometimes it's a pain to communicate those guarantees adequately in your source code, or your platform targets don't support all the hints you might need to apply.
Most of the time, a human will do worse than the compiler. But, a human who knows what they're doing and understands the problem well can still improve on the output.