Hacker News new | ask | show | jobs
by brucehoult 392 days ago
Also, there is another way to do this while keeping 64 bit limbs. All variables uint64_t.

    s0 += a0;
    s1 += a1;
    s2 += a2;
    s3 += a3;
    
    c0 = s0 < a0; // RISC-V `sltu`
    c1 = s1 < a1;
    c2 = s2 < a2;
    
    if (s1 == -1) goto propagate0; // executes 1 time in 18,446,744,073,709,551,616
    check_s2:
    if (s2 == -1) goto propagate1; // ditto
    
    add_carries:
    s1 += c0;
    s2 += c1;
    s3 += c2;
    goto done;
    
    propagate0: c1 = c0; goto check_s2;
    
    propagate1: c2 = c1; goto add_carries;
    
    done:
The key insight here is that unless the sum at a particular limb position is all 1s the carry out from that position DOES NOT DEPEND on the carry in to that limb position, but only on whether the original add in that position produces a carry. If the sum is all 1s the the carry out is the same as the carry in.

If you express this with a conditional branch which is overwhelmingly predicted as not taken then the code should execute each block of instructions entirely in parallel, provided that multiple conditional branches can be predicted as not-taken in the same clock cycle.

One time in 2^64 it will execute very slowly.

With 4 limb numbers on a 4-wide machine this doesn't offer an advantage over `adc` as there are also 4 code blocks. But on, say, an 8-wide machine with 8 limb numbers you're really starting to gain.

It's probably not going to help on current x86_64, but might well do on Apple's M* series, where even the M1 is 8-wide, though it might be tricky to work around the Arm ISA.

When the 8-wide RISC-V Ascalon processor from Tenstorrent hits hopefully late this year or early 2026 we will really see. And others such as Ventana, Rivos, XiangShan.

This will work even better in a wide SIMD, if you have a fast 1-lane shift (Called slideup on RISC-V).

2 comments

Neat, but if you're using this in cryptographic code (one of the main consumers of bignums), keep in mind that secret data reaching branches is usually a side-channel risk. Sure, it's only 1 time in 2^64 on random data, but if you're depending on that, then you have to consider whether an attacker can choose data that will make it happen more often.

If you can substitute a cmov without control flow then it's probably safer, e.g. c1 |= c0 & seq(s1,-1) or so, so long as you can make sure the compiler won't turn it into a branch.

It does add a data dependency though ...

Yes, for cryptography you'd like to have constant time, but this has to be an awfully low bandwidth channel!

A `cmov` will have the same serialisation problem as `adc` but on machines without carry it might still leave you better off than the obvious `add s,a,b; sltu co,s,a; add s,s,ci; sltu t,s,ci; or co,co,t`.

I think you want to write:

  if (s1 == -1)
     c1 = c0;
  if (s2 == -1)
     c2 = c1;

These can become conditional moves on x86. I've often thought RISC-V should have implemented an IF instruction instead of compare and branch. IF would cause the next instruction to be executed conditionally while not needing a flag register at the ISA level. They could have required only branch and jump to be conditional, but it turns out conditional mov, load, and store are all very useful in real code.
The problem is that, as far as I know, a conditional move is going to introduce a data dependency from c0 to c1 to c2 that is the exact thing we are trying to get rid of. The cmov is a constant time instruction, not a speculated instruction like a conditional branch.

The entire point of what I did is that the two conditional branches will be predicted not taken, so the CPU will 99.9999999999999999946% of the time not even see the `c1 = c0` and `c2 = c1` instructions that introduce the sequential dependencies.

That sounds like it would be quite a pain to implement and program. E.g. what happens if there's an interrupt between the IF and the following instruction? You need to add a CSR to read/write the conditional state, similar to the vector control CSRs (vstart etc.). Hard to see how that extra complexity would be worth it.

Modern branch predictors are very good and most branches are very predictable.