In some cases I suspect it depends a lot on the hardware. I did a bit of research into the "Integer minimum" algorithm. Given the following straightforward implementation:
int min(int x, int y) { return y < x ? y : x; }
..most compilers will produce something very much like the following for x86-64 (from Compiler Explorer https://gcc.godbolt.org using -Ofast -march=haswell):
min(int, int):
cmp esi, edi
mov eax, edi
cmovle eax, esi
ret
Compare that to the generated instructions for the algorithm listed in the "Integer minimum or maximum" section (x+(((y-x)>>(WORDBITS-1))&(y-x))):
min(int, int):
sub esi, edi
mov eax, esi
sar eax, 31
and eax, esi
add eax, edi
ret
It's not immediately obvious to me which one is likely to be faster (I'm not an expert on x86-64 assembly).
Probably not. These low level tricks are supposed to be hidden away inside functions, and any good compiler ought to produce similar machine code in the majority of these examples.
> Do these actually run faster than the code produced by a good compiler? Benchmarks?
Some of these are absolutely suspect. Now, these are all architecture specific: my results below are for gcc, targeting the Intel chip in the MBP in my lap. (Of course, some of the article's code samples assume certain things about an architecture, and are not portable.)
For example,
> Absolute Value of a Float
> IEEE floating point uses an explicit sign bit, so the absolute value can be taken by a bitwise AND with the complement of the sign bit. For IA32 32-bit, the sign bit is an int value of 0x80000000, for IA32 64-bit, the sign bit is the long long int value 0x8000000000000000. Of course, if you prefer to use just int values, the IA32 64-bit sign bit is 0x80000000 at an int address offset of +1. For example:
double x;
/* make x = abs(x) */
*(((int *) &x) + 1) &= 0x7fffffff;
First, this is undefined behavior. The portable way to do this is fabs(), which is part of the standard library. For me, calling fabs() compiles to a single instruction:
fabs() compiled down to essentially the same bitwise &, in assembly, that the article is doing. The article's "advice" also gets compiled to exactly the same instruction & set of constants, which I actually find way more interesting.
> Integer Constant Multiply
> Given an integer value x and an integer or floating point value y, the value of xy can be computed efficiently using a sequence derived from the binary value of x. For example, if x is 5 (4 + 1):*
y2 = y + y;
y4 = y2 + y2;
result = y + y4;
> In the special case that y is an integer, this can be done with shifts:
y4 = (y << 2);
result = y + y4;
Compilers will do this for you, today, automatically. In the example case of multiplying by 5, gcc will do better than the article's code, by using lea cleverly:
leal (%rdi,%rdi,4), %eax
gcc also compiles the article's shift-then-add strategy to exactly the same leal.
The article mentions this optimization again, later,
> Shift-and-Add Optimization
> Rather obviously, if an integer multiply can be implemented by a shift-and-add sequence, then a shift-and-add sequence can be implemented by multiplying by the appropriate constant... with some speedup on processors like the AMD Athlon. Unfortunately, GCC seems to believe constant multiplies should always be converted into shift-and-add sequences, so there is a problem in using this optimization in C source code.
The above leal instruction was the output of GCC, so this is obviously wrong. One can maybe say the article is old, but I'm pretty sure GCC has had the "use leal to do constant multiplies" optimization for over a decade now, but then, perhaps the article is that old. If so, the advice would appear to no longer apply.
Write it the obvious, readable way first. If you decide to optimize at the level this article is discussing, you should be reading the assembly to see if your optimizations will have any effect.
But not all of them are bad. The "is this unsigned integer a power of 2" code is correct, I believe, and if you understand binary, not that hard to understand.
That's not to say these optimisations are no longer needed; you just need to be smarter about applying them. If you're multiplying by 2^n - 1, for example, keeping n around and writing a pair of shifts is not an optimisation a compiler will be able to make, and even when doing the power inline (x * ((1 << n) - 1)) most compilers aren't able to optimise it properly.