Also, there is another way to do this while keeping 64 bit limbs. All variables uint64_t. s0 += a0;
s1 += a1;
s2 += a2;
s3 += a3;
c0 = s0 < a0; // RISC-V `sltu`
c1 = s1 < a1;
c2 = s2 < a2;
if (s1 == -1) goto propagate0; // executes 1 time in 18,446,744,073,709,551,616
check_s2:
if (s2 == -1) goto propagate1; // ditto
add_carries:
s1 += c0;
s2 += c1;
s3 += c2;
goto done;
propagate0: c1 = c0; goto check_s2;
propagate1: c2 = c1; goto add_carries;
done:
The key insight here is that unless the sum at a particular limb position is all 1s the carry out from that position DOES NOT DEPEND on the carry in to that limb position, but only on whether the original add in that position produces a carry. If the sum is all 1s the the carry out is the same as the carry in.If you express this with a conditional branch which is overwhelmingly predicted as not taken then the code should execute each block of instructions entirely in parallel, provided that multiple conditional branches can be predicted as not-taken in the same clock cycle. One time in 2^64 it will execute very slowly. With 4 limb numbers on a 4-wide machine this doesn't offer an advantage over `adc` as there are also 4 code blocks. But on, say, an 8-wide machine with 8 limb numbers you're really starting to gain. It's probably not going to help on current x86_64, but might well do on Apple's M* series, where even the M1 is 8-wide, though it might be tricky to work around the Arm ISA. When the 8-wide RISC-V Ascalon processor from Tenstorrent hits hopefully late this year or early 2026 we will really see. And others such as Ventana, Rivos, XiangShan. This will work even better in a wide SIMD, if you have a fast 1-lane shift (Called slideup on RISC-V). |
If you can substitute a cmov without control flow then it's probably safer, e.g. c1 |= c0 & seq(s1,-1) or so, so long as you can make sure the compiler won't turn it into a branch.
It does add a data dependency though ...