Hacker News new | ask | show | jobs
by rustybolt 1591 days ago
If you, for example, want to do addition of four 8-bit integers within a 32-bit register, you have to use similar techniques to stop the carry from propagating.

For example, when x and y are 32-bit integers holding 4 8-bit integers, you can do

    z = (x ^ y) + (x & y) & 0x7f7f7f7f;
Now z holds four 8-bit integers which hold the sum (modulo 256) of the integers of x and y. The bit mask is to stop the carry from propagating.
2 comments

This doesn't work because you're not left-shifting (doubling) the carry. But when adding the shifted carry to (x ^ y) we're back to potentially overflowing the highest bits. The solution is to add the highest and the lower bits separately:

  lower = 0x7f7f7f7f;
  highest = ~lower;
  z = ((x & lower) + (y & lower)) ^ ((x ^ y) & highest);
Note this only improves performance for larger container integers.
You're completely right! Typically you don't care about overflow though (and you should use unsigned ints to avoid undefined behavior).
That's pretty neat. Is it actually any faster than just doing four 8-bit adds, though?

Presumably it would take 4 logical instructions to do the vectored math, vs. 4 logical instructions to do the scalar additions.

I suppose you're looking at a minimum of two registers for the vectored approach, vs. 8 for the scalar approach. Having the result available in separate registers makes them immediately available for use, though.

There's also the overhead of getting the numbers in and out of memory. Loading and storing one word is obviously going to be way better than loading 4 bytes individually.

It seems to me like the vectored approach would be better for algorithms that require iterating through a large dataset in memory. The scalar approach would be better for algorithms that have a bunch of dependent calculations. Perhaps that's an obvious conclusion!

That's pretty neat though. For the large dataset scenario, perhaps you could get a significant speedup on relatively simple architectures such as cortex-m microcontrollers. I suspect that sufficiently modern high end CPUs/compilers wouldn't benefit so much from it, though? All the pipelining, superscaling and caching and whatnot could sufficiently mask the latencies of the memory accesses to the point of being irrelevant. Also, a sufficiently clever compiler could implement the loop with actual SIMD instructions and achieve significantly higher performance than the manual in-register optimization.

This would be a fun way to compute a basic 8-bit checksum on a binary blob in a microcontroller... Not that it would be practically useful because any non-trivially sized blob would be better served with at least a Fletcher checksum if not a full CRC, both of which seemingly lack the necessary associativity.

> That's pretty neat. Is it actually any faster than just doing four 8-bit adds, though?

The operation itself is not necessarily faster (it could be faster due to pipelining, I think, and it would probably ne faster when storing 8 8-bit ints in a 64-bit int), but it can save the hassle and runtime of packing and unpacking into four variables.

This technique was useful on a 68000 in a 4 voice software PCM sampled instrument music player running on interrupts on the Atari ST back in the day.
Found the game where I worked. Default hardware bleeps and bloops in first version https://www.youtube.com/watch?v=RGOdHT29Jpc

Four voice digi player using SWAR in second version https://www.youtube.com/watch?v=1GrdvcghDXE

Very cool! The difference in audio fidelity is quite impressive.

Reminds me of a weekend hobby project I did back in 2014 or so. I had an itch to play with analog video signal generation from an atmega328p. Rather than use one of the existing libraries, though, I started from scratch with the goal of achieving the highest possible resolution. I used the SPI peripheral to clock out 8 pixels at a time at 8Mhz without any gaps, giving me something like 12 instructions to prepare the next byte. There wasn't enough RAM for a frame buffer at the resolution, so I instead used character tiles; that ate up the whole budget. I forget what the resolution was, but it was significantly higher than the existing library was capable of. There was a jitter, which I tracked down the the variability in interrupt latency due to the AVR having variable cycle length instructions. I was using a timer interrupt to schedule the start of and complete transmission of each scanline, so that the main program could focus purely on application logic. I wrote an inline assembly routine at the start of the interrupt handler to insert a variable number of noop instructions depending on the relative phase of the hardware timer, and the output became rock solid.

That of course reminds me of a project in 2007 where I needed to go the other direction, and decode an analog video signal on an 8 bit PIC microcontroller. The signal was from a camera on an actuator, meant to detect the relative position of the sun for the purpose of aiming a parabolic solar concentrator. I was able to filter out all visible light with some overdeveloped film negative so that the video signal was simply a white dot on a black background, and then wire it up through some voltage dividers to the PIC's two voltage comparators. One comparator detected sync pulses, and the other one detected black to white transitions. The firmware would simply track the timing of sync pulses to know the current scanline and position within the current scanline. Good times!

Very nice! Good work.

Some more info on the digi player on the ST. It used a timer interrupt to service the PCM sample but if you used just the interrupt there was significant noise because there was significant variability of the timing on the interrupt. To get the timing tighter the interrupt timing was changed to hit the routine on every video line just prior to the actual hsync and then hsync was polled to get very precise timing.

The PCM was just a linearization by combining three logarithmic volumes of the three PSG voices.

During the title sequence, a special version of the code was running where several 68000 registers were reserved globally for the digi player. So those did not need to be saved / restored in the interrupt routine!

The SWAR was involved when advancing the four wrapping 8 bit indices into the each of the four voice's 256 entry sample tables. This was part of a monumental effort to get the interrupt routine to be as quick as possible.

Your video decoding reminds me of when I worked at a video card company in the nineties we had a competitive advantage by using a commodity part in an unusual way. This video decoder hardware was commonly used to take composite video and decode it. We supported that but we also did a bunch of advanced features by using a seldom used mode where it could be used in a raw mode where it took the composite signal and stored the analog to digital conversion in memory. We had high speed assembly code that could decode the video better than the hardware and supported some cool additional features. Anyway... memories a bit hazy. Been a while but I remember it being very cool.