| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dansalvato 526 days ago

The interesting thing about testing values (like testing whether a number is even) is that at the assembly level, the CPU sets flags when the arithmetic happens, rather than needing a separate "compare" instruction.

gcc likes to use `and edi,1` (logical AND between 32-bit edi register and 1). Meanwhile, clang uses `test dil,1` which is similar, except the result isn't stored back in the register, which isn't relevant in my test case (it could be relevant if you want to return an integer value based on the results of the test).

After the logical AND happens, the CPU's ZF (zero) flag is set if the result is zero, and cleared if the result is not zero. You'd then use `jne` (jump if not equal) or maybe `cmovne` (conditional move - move register if not equal). Note again that there is no explicit comparison instruction. If you don't use O3, the compiler does produce an explicit `cmp` instruction, but it's redundant.

Now, the question is: Which is more efficient, gcc's `and edi,1` or clang's `test dil,1`? The `dil` register was added for x64; it's the same register as `edi` but only the lower 8 bits. I figured `dil` would be more efficient for this reason, because the `1` operand is implied to be 8 bits and not 32 bits. However, `and edi,1` encodes to 3 bytes while `test dil,1` encodes to 4 bytes. I guess the `and` instruction lets you specify the bit size of the operand regardless of the register size.

There is one more option, which neither compiler used: `shr edi,1` will perform a right shift on EDI, which sets the CF (carry) flag if a 1 is shifted out. That instruction only encodes to 2 bytes, so size-wise it's the most efficient.

The right-shift option fascinates me, because I don't think there's really a C representation of "get the bit that was right-shifted out". Both gcc and clang compile `(i >> 1) << 1 == i` the same as `i & 1 == 0` and `i % 2 == 0`.

Which of the above is most efficient on CPU cycles? Who knows, there are too many layers of abstraction nowadays to have a definitive answer without benchmarking for a specific use case.

I code a lot of Motorola 68000 assembly. On m68k, shifting right by 1 and performing a logical AND both take 8 CPU cycles. But the right-shift is 2 bytes smaller, because it doesn't need an extra 16 bits for the operand. That makes a difference on Amiga, because (other than size) the DMA might be shared with other chips, so you're saving yourself a memory read that could stall the CPU while it's waiting its turn. Therefore, at least on m68k, shifting right is the fastest way to test if a value is even.

2 comments

userbinator 526 days ago

That instruction only encodes to 2 bytes, so size-wise it's the most efficient.

In isolation it's the smallest, but it's no longer the smallest if you consider that the value, which in this example is the loop counter, needs to be preserved, meaning you'll need at least 2 bytes for another mov to make a copy. With test, the value doesn't get modified.

link

dansalvato 526 days ago

That is true, I deliberately set up an isolated scenario to do these fun theoretical tests. It actually took some effort to stop the compiler from being too smart, because it would want to transform the result into a return value, or even into a pointer offset, to avoid branching.

link

amiga386 526 days ago

> On m68k, shifting right by 1 and performing a logical AND both take 8 CPU cycles. But the right-shift is 2 bytes smaller

There's also BTST #0,xx but it wastefully needs an extra 16 bits say which bit to test (even though the bit can only be from 0-31)

> That makes a difference on Amiga, because (other than size) the DMA might be shared with other chips, so you're saving yourself a memory read that could stall the CPU while it's waiting its turn.

That's a load-bearing "could". If the 68000 has to read/write chip RAM, it gets the even cycles while the custom chips get odd cycles, so it doesn't even notice (unless you're doing something that steals even cycles from the CPU, e.g. the blitter is active and you set BLTPRI, or you have 5+ bitplanes in lowres or 3+ bitplanes in highres)

link

dansalvato 526 days ago

> There's also BTST #0,xx but it wastefully needs an extra 16 bits say which bit to test (even though the bit can only be from 0-31)

That reminds me, it's theoretically fastest to do `and d1,d0` e.g. in a loop if d1 is pre-loaded with the value (4 cycles and 1 read). `btst d1,d0` is 6 cycles and 1 read.

> the blitter is active and you set BLTPRI

I thought BLTPRI enabled meant the blitter takes every even DMA cycle it needs, and when disabled it gives the CPU 1 in every 4 even DMA cycles. But yes, I'm splitting hairs a bit when it comes to DMA performance because I code game/demo stuff targeting stock A500, meaning one of those cases (blitter running or 5+ bitplanes enabled) is very likely to be true.

link

amiga386 526 days ago

> it's theoretically fastest to do `and d1,d0` e.g. in a loop

That's true, although I'd add that ASR/AND are destructive while BTST would be nondestructive, but we're pretty far down a chain of hypotheticals at this point (why would someone even need to test evenness in a loop, when they could unroll the loop to doing 2/4/6/8 items at a time with even/odd behaviour baked in)

> I thought BLTPRI enabled meant the blitter takes every even DMA cycle it needs, and when disabled it gives the CPU 1 in every 4 even DMA cycles

Yes, that is true: https://amigadev.elowar.com/read/ADCD_2.1/Hardware_Manual_gu... "If given the chance, the blitter would steal every available Chip memory cycle [...] If DMAF_BLITHOG is a 1, the blitter will keep the bus for every available Chip memory cycle [...] If DMAF_BLITHOG is a 0, the DMA manager will monitor the 68000 cycle requests. If the 68000 is unsatisfied for three consecutive memory cycles, the blitter will release the bus for one cycle."

> one of those cases is very likely to be true

It blew my mind when I realised this is probably why Workbench is 4 colours by default. If it were 8, an unexpanded Amiga would seem a lot slower to application/productivity users.

link