Or any other highly optimised numerical codebase. From a quick glance at OpenBLAS, it looks like they have a lot of microarchitecture-specific assembly code, with dispatching code to pick out the appropriate implementations.
For debugging you can actually use gdb in assembly tui mode and step through the instructions! You can even get it hooked up in vs code and remote debug an embedded target using the full IDE. Full register view, watch registers for changes, breakpoints, step instruction to instruction.
Pipelining and optimisations can make the intrinsics a bit fucky though, have to make sure it’s -O0 and a proper debug compilation.
I have line by line debugged raw assembly many times. It’s just a pain to initially set up. Honestly not very different from c/c++ debugging once running.
Sure, but gdb doesn't know what the function parameters are, or on some platforms where functions start and end, crashes don't have source lines, and ASan doesn't work. (though of course valgrind does)
If you are handwriting the function in assembly, you'll know what registers hold the function parameters, what types of values they are supposed to be, and with care, you can produce debug information and CFI directives to allow for stack unwinding, it's just annoying to do - but that's just the tradeoff you make for the performance improvement I suppose.
I don’t know if this is frowned upon or not among assembly programmers, but I often just use naked functions in C with asm bodies, which gdb will provide the args for, rather than linking against a separate assembly file.
If you write your assembly to look like C code GDB is more than happy to provide you with much of that to the extent that it can. In particular, it will identify functions and source mappings from debug symbols.
ffmpeg might have amazingly efficient inner loops (i.e. low-level decoding/encoding), but the broader architecture (e.g. memory buffer implementations, etc) is quite inefficient. Like the low-level media code it's not that each component itself is inefficient, it's that the interfaces and control flow semantics between them obstruct both compiler and architectural optimizations.
When I wrote a transcoding multimedia server I ended up writing my own framework and simply pulling in the low-level decoders/encoders, most of which are maintained as separate libraries. I ended up being able to push at least an order of magnitude more streams through the server than if I had used ffmpeg (more specifically, libavcodec) itself, even though I still effectively ended up with an abstraction layer intermediating encoder and format types. And I never wrote a single line of assembly.
There's no secret sauce to optimization: it's not about using assembly, fancier data structures, etc; it's learning to identify impedance mismatches, and those exist up and down the stack. Sometimes a "dumber" data structure or algorithm can create opportunities (for the developer, for the compiler) for more harmonious data and code flow. And impedance mismatches sometimes exist beyond the code--e.g. mismatch between functionality and technical capabilities, where your best might be to redefine the problem, which can often be done without significantly changing how users experience the end product.
> most of which are maintained as separate libraries
This is so confusing I can’t tell if you’re actually talking about libavcodec. The whole point is to combine codecs to share common code, “most” decoders certainly aren’t available elsewhere.
If you just want to call libx264 directly go ahead and do that of course. libx264 uses assembly just as much or more than libavcodec though.
I have a lot of sympathy for wanting efficient code. But let's indeed have a look:
https://github.com/FFmpeg/FFmpeg/blob/7bbad32d5ab69cb52bc92a...
There are so many macros, %if and clutter here that it's difficult (for me?) to keep the big picture in mind.
This reminds me of a retrospective of an OS/window manager written in assembly - they were great about avoiding tiny overheads, but expressed regret that the whole system ended up slow because it was hard to reason about bigger things such as how often to redraw everything, similar to what people are saying here.
To be clear: let's indeed optimize and vectorize, but better to build on intrinsics than go all the way down to assembly.
There're too many different assemblies: inline, MASM, NASM, FASM, YASM. They come with their unique quirks, and they complicate build.
Intrinsics are more portable. It's trivial to re-compile legacy SSE intrinsics into AVX1. You won't automatically get 32-byte vectors this way, but you will get VEX encoding, broadcasts for _mm_set1_something, and more.
Readability depends on the code style. When you write intrinsics using "assembly with types" style, actual assembly is indeed more readable. OTOH, with C++ it's possible to make intrinsics way better than assembly: arithmetic operators instead of vaddpd/vsubpd/vmulpd/vdivpd, strongly-typed classes wrapping low-level vectors for specific use cases, etc.
Update: most real-life functions contain scalar code (like loops), also auto-generated code (stack frame setup, back up / restore of non-volatile registers). When coding non-inline assembly, developer needs to do that manually in assembly, this can be hard to do, and may cause bugs like these https://github.com/openssl/openssl/issues/12328https://news.ycombinator.com/item?id=33705209
FFmpeg code is god-awful. A lot of it is like from 2002 and written without regards to any sort of "sanity". People who write assembly routines these days have a structure to their code, and if they overrun buffers or whatever they'll document what alignment assumptions they're making. FFmpeg will just start patching its own code at runtime because someone thought it was a good idea on Pentium processors.
https://github.com/xianyi/OpenBLAS/blob/02ea3db8e720b0ffb3e2...
https://github.com/xianyi/OpenBLAS/blob/02ea3db8e720b0ffb3e2...