Hacker News new | ask | show | jobs
by ridiculous_fish 1590 days ago
The "SWAR" approach `(a & b) + (a ^ b) / 2` looks bizarre but can be understood.

Adding two bits produces a sum and a carry:

    0 + 0 = 0, carry 0
    1 + 0 = 1, carry 0
    0 + 1 = 1, carry 0
    1 + 1 = 0, carry 1
So the sum is XOR, and the carry is bitwise AND. We can rewrite x + y as (x ^ y) + (x & y)*2

Distribute the divide, and you get (x ^ y)/2 + (x & y) which is the mystery expression.

(Note this distribution is safe only because (x & y)*2 is even.)

5 comments

Alternatively, a direct explanation:

a & b is the bits they have in common. If they both have a bit x, you should keep it, because the average of x and x is x.

a ^ b is the bits that only one of them have. You should halve these bits, because the average of x and 0 is x/2.

This explanation makes a ton more sense!
But it has lots of assumptions/prerequisites about taking the mean baked-in. It happens to work for this particular case, but in general you don't get far with such hand-wavy reasoning since you just get confused with which assumptions hold and which don't.

The original explanation was the actual SIMD approach, which is really cool. You can extend it right away to other problems.

The explanation seems to work even in base 10.

Let's do the average of 85 and 95. The leftmost digit of both is equal, so we leave it: X5 The "xor"/2 is the sum of the non equal digits divided by 2 (respecting their powers): (8+9)10 = (17)10 = 85 Now, we add them: 5 + 85 = 90 (which is the average of 85 and 95).

Let's take now 87 and 89: The rightmost digits are equal: 8X We do the 'xor'/2 for the left most: (7+9)/2 = 8 8X + 8 = 88

Let me know if there are some examples for which it would fail (I only spent a minute to test if the explanation extends to other bases).

The average of 15 and 30 is not 22. The average of 51 and 03 is not 22.
It's a half-adder in software!
Thanks for this explanation.

What sets this post apart from the rest is that it actually provides useful information for everybody, after all those other posts of type "but that's obvious, I looked at it for 1 minute, saw half the solution, was too lazy for the rest and in hindsight it's all obvious. What a ridiculous patent system."

Not to say that the patent system is problematic...

Sorry i'm confused and i asked elsewhere.

How is the above SWAR? It looks like a regular set of instructions.

If you, for example, want to do addition of four 8-bit integers within a 32-bit register, you have to use similar techniques to stop the carry from propagating.

For example, when x and y are 32-bit integers holding 4 8-bit integers, you can do

    z = (x ^ y) + (x & y) & 0x7f7f7f7f;
Now z holds four 8-bit integers which hold the sum (modulo 256) of the integers of x and y. The bit mask is to stop the carry from propagating.
This doesn't work because you're not left-shifting (doubling) the carry. But when adding the shifted carry to (x ^ y) we're back to potentially overflowing the highest bits. The solution is to add the highest and the lower bits separately:

  lower = 0x7f7f7f7f;
  highest = ~lower;
  z = ((x & lower) + (y & lower)) ^ ((x ^ y) & highest);
Note this only improves performance for larger container integers.
You're completely right! Typically you don't care about overflow though (and you should use unsigned ints to avoid undefined behavior).
That's pretty neat. Is it actually any faster than just doing four 8-bit adds, though?

Presumably it would take 4 logical instructions to do the vectored math, vs. 4 logical instructions to do the scalar additions.

I suppose you're looking at a minimum of two registers for the vectored approach, vs. 8 for the scalar approach. Having the result available in separate registers makes them immediately available for use, though.

There's also the overhead of getting the numbers in and out of memory. Loading and storing one word is obviously going to be way better than loading 4 bytes individually.

It seems to me like the vectored approach would be better for algorithms that require iterating through a large dataset in memory. The scalar approach would be better for algorithms that have a bunch of dependent calculations. Perhaps that's an obvious conclusion!

That's pretty neat though. For the large dataset scenario, perhaps you could get a significant speedup on relatively simple architectures such as cortex-m microcontrollers. I suspect that sufficiently modern high end CPUs/compilers wouldn't benefit so much from it, though? All the pipelining, superscaling and caching and whatnot could sufficiently mask the latencies of the memory accesses to the point of being irrelevant. Also, a sufficiently clever compiler could implement the loop with actual SIMD instructions and achieve significantly higher performance than the manual in-register optimization.

This would be a fun way to compute a basic 8-bit checksum on a binary blob in a microcontroller... Not that it would be practically useful because any non-trivially sized blob would be better served with at least a Fletcher checksum if not a full CRC, both of which seemingly lack the necessary associativity.

> That's pretty neat. Is it actually any faster than just doing four 8-bit adds, though?

The operation itself is not necessarily faster (it could be faster due to pipelining, I think, and it would probably ne faster when storing 8 8-bit ints in a 64-bit int), but it can save the hassle and runtime of packing and unpacking into four variables.

This technique was useful on a 68000 in a 4 voice software PCM sampled instrument music player running on interrupts on the Atari ST back in the day.
Found the game where I worked. Default hardware bleeps and bloops in first version https://www.youtube.com/watch?v=RGOdHT29Jpc

Four voice digi player using SWAR in second version https://www.youtube.com/watch?v=1GrdvcghDXE

Very cool! The difference in audio fidelity is quite impressive.

Reminds me of a weekend hobby project I did back in 2014 or so. I had an itch to play with analog video signal generation from an atmega328p. Rather than use one of the existing libraries, though, I started from scratch with the goal of achieving the highest possible resolution. I used the SPI peripheral to clock out 8 pixels at a time at 8Mhz without any gaps, giving me something like 12 instructions to prepare the next byte. There wasn't enough RAM for a frame buffer at the resolution, so I instead used character tiles; that ate up the whole budget. I forget what the resolution was, but it was significantly higher than the existing library was capable of. There was a jitter, which I tracked down the the variability in interrupt latency due to the AVR having variable cycle length instructions. I was using a timer interrupt to schedule the start of and complete transmission of each scanline, so that the main program could focus purely on application logic. I wrote an inline assembly routine at the start of the interrupt handler to insert a variable number of noop instructions depending on the relative phase of the hardware timer, and the output became rock solid.

That of course reminds me of a project in 2007 where I needed to go the other direction, and decode an analog video signal on an 8 bit PIC microcontroller. The signal was from a camera on an actuator, meant to detect the relative position of the sun for the purpose of aiming a parabolic solar concentrator. I was able to filter out all visible light with some overdeveloped film negative so that the video signal was simply a white dot on a black background, and then wire it up through some voltage dividers to the PIC's two voltage comparators. One comparator detected sync pulses, and the other one detected black to white transitions. The firmware would simply track the timing of sync pulses to know the current scanline and position within the current scanline. Good times!

Wikipedia says: "It also refers to the use of SIMD with general-purpose registers and instructions that were not meant to do it at the time, by way of various novel software tricks."

https://en.wikipedia.org/wiki/SWAR

Maybe that's the way it's meant?

Compilers might be smart enough to pick up the idiom used in the example and compile them to something done in parallel?

SIMD is "single instruction, multiple data". In this case, the data are single bits. It's basically operations on bit-vectors of width 32 (or whatever the machine word width is). All the bits are treated independently. That's exactly how you'd do SIMD elsewhere.
Why not just

   (a >> 1) + (b >> 1) + (a & 1) & (b & 1)
That's the "patented solution" referred to in the article.
Which is shocking to me, since this is the solution which jumps immediately to mind (and usually patents should be above that line). I could understand if the next solution was patented, but not this one.
Don't read the article, and you would never know it's patented.
"just"? That's 2 additions, 2 shifts, and 2-3 bitwise operations, while the other method is 1 addition, 1 shift, and 2 bitwise operations.
I would think optimizing compilers can optimize this expression to the one in the article, but I tried it on godbolt and they don't.