|
The interesting thing about testing values (like testing whether a number is even) is that at the assembly level, the CPU sets flags when the arithmetic happens, rather than needing a separate "compare" instruction. gcc likes to use `and edi,1` (logical AND between 32-bit edi register and 1). Meanwhile, clang uses `test dil,1` which is similar, except the result isn't stored back in the register, which isn't relevant in my test case (it could be relevant if you want to return an integer value based on the results of the test). After the logical AND happens, the CPU's ZF (zero) flag is set if the result is zero, and cleared if the result is not zero. You'd then use `jne` (jump if not equal) or maybe `cmovne` (conditional move - move register if not equal). Note again that there is no explicit comparison instruction. If you don't use O3, the compiler does produce an explicit `cmp` instruction, but it's redundant. Now, the question is: Which is more efficient, gcc's `and edi,1` or clang's `test dil,1`? The `dil` register was added for x64; it's the same register as `edi` but only the lower 8 bits. I figured `dil` would be more efficient for this reason, because the `1` operand is implied to be 8 bits and not 32 bits. However, `and edi,1` encodes to 3 bytes while `test dil,1` encodes to 4 bytes. I guess the `and` instruction lets you specify the bit size of the operand regardless of the register size. There is one more option, which neither compiler used: `shr edi,1` will perform a right shift on EDI, which sets the CF (carry) flag if a 1 is shifted out. That instruction only encodes to 2 bytes, so size-wise it's the most efficient. The right-shift option fascinates me, because I don't think there's really a C representation of "get the bit that was right-shifted out". Both gcc and clang compile `(i >> 1) << 1 == i` the same as `i & 1 == 0` and `i % 2 == 0`. Which of the above is most efficient on CPU cycles? Who knows, there are too many layers of abstraction nowadays to have a definitive answer without benchmarking for a specific use case. I code a lot of Motorola 68000 assembly. On m68k, shifting right by 1 and performing a logical AND both take 8 CPU cycles. But the right-shift is 2 bytes smaller, because it doesn't need an extra 16 bits for the operand. That makes a difference on Amiga, because (other than size) the DMA might be shared with other chips, so you're saving yourself a memory read that could stall the CPU while it's waiting its turn. Therefore, at least on m68k, shifting right is the fastest way to test if a value is even. |
In isolation it's the smallest, but it's no longer the smallest if you consider that the value, which in this example is the loop counter, needs to be preserved, meaning you'll need at least 2 bytes for another mov to make a copy. With test, the value doesn't get modified.