Not in the general case. It's been a long time since x86 assembly developers could commonly beat a decent optimizing compiler.
The thing about compilers is that they're leveraging, even if imperfectly, the collective wisdom of their authors and of the companies who actually built the chips and have offered insight, advice, and sometimes even code. It's very probable they know more performance tricks than you do.
One problem is landmines in the ISA, such as instructions that look like they exist to be used, but are really traps implemented in suboptimal microcode for the unwary programmer who didn't look closely at their performance characteristics. Or certain sequences of instructions that might combine to do something ridiculously slow[1].
These landmines vary by microarchitecture. An instruction that's incredibly slow on one line of x86 chips might be a wonder-drug on another. This both increases the probability that your code will hit a landmine on at least some CPUs, and gives you a possible "in": Compilers aren't going to optimize perfectly for every microarchitecture. If you know exactly what you're doing (or spend a hell of a lot of time on trial and error), you might be able to come up with optimal codepaths for specific chips that the compiler didn't.
By and large it's not worth it, though. Hand-tuned assembly still ends up in places, but increasingly rarely, and it's confined to small hot-spots. A particular algorithm or part of an algorithm gets re-implemented in assembly because the compiler just can't get it right.
[1] I could have sworn there was a story about this just recently, but I can't seem to find it. Something like a piece of code running way slower than anyone thought it should, until an AMD engineer piped up and said "Oh yeah, don't do that, it causes a pipeline flush." for reasons that were utterly non-obvious to anyone who didn't know the internals of the chip.
I don't want to be too harsh - this is a fun idea, and I'm prone to silly fantasies about rewriting slow code in assembly myself - but this particular assembly doesn't take advantage of many of the "dirty tricks" that are available in low-level code.
As one example, check out the content-type detection, which is essentially a long chain of repeated strlen + strcmp; assembly language doesn't magically make bad algorithms fast.
Not to mention that long chain is ugly to read. I would rather see a macro defined and called multiple times than to see the same block of code copy/pasted over and over.
Maybe, maybe not, but for a web server, if asm vs. C is making a noticeable impact on the overall performance, one of them is doing something very wrong - they should spend most of their time in sys-calls to shuffle data to/from the network, not executing web server userland code.
> Is handwritten assembly faster than GCC/clang-written assembly?
Sometimes, but the biggest case is if you can carefully arrange a tight inner loop, especially one that case make use of SIMD, like some DSP and scientific-computing code. Auto-vectorizers are getting better, but still miss lots of cases, so a skilled asm programmer can beat the compiler. The more "spread out" the performance-critical code is, in general (i.e. performance not dominated by one or two tight loops), the harder it is for hand-coding asm to beat a compiler; humans are not that good at doing whole-program optimization on large codebases. The more cross-platform the code has to be, the worse for the asm programmer as well: beating gcc's code-gen on one architecture is easier than beating it everywhere.
The thing about compilers is that they're leveraging, even if imperfectly, the collective wisdom of their authors and of the companies who actually built the chips and have offered insight, advice, and sometimes even code. It's very probable they know more performance tricks than you do.
One problem is landmines in the ISA, such as instructions that look like they exist to be used, but are really traps implemented in suboptimal microcode for the unwary programmer who didn't look closely at their performance characteristics. Or certain sequences of instructions that might combine to do something ridiculously slow[1].
These landmines vary by microarchitecture. An instruction that's incredibly slow on one line of x86 chips might be a wonder-drug on another. This both increases the probability that your code will hit a landmine on at least some CPUs, and gives you a possible "in": Compilers aren't going to optimize perfectly for every microarchitecture. If you know exactly what you're doing (or spend a hell of a lot of time on trial and error), you might be able to come up with optimal codepaths for specific chips that the compiler didn't.
By and large it's not worth it, though. Hand-tuned assembly still ends up in places, but increasingly rarely, and it's confined to small hot-spots. A particular algorithm or part of an algorithm gets re-implemented in assembly because the compiler just can't get it right.
[1] I could have sworn there was a story about this just recently, but I can't seem to find it. Something like a piece of code running way slower than anyone thought it should, until an AMD engineer piped up and said "Oh yeah, don't do that, it causes a pipeline flush." for reasons that were utterly non-obvious to anyone who didn't know the internals of the chip.