| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by brucehoult 382 days ago
	Someone working entirely on x86_64 very nicely demonstrates that RISC-V is not wrong to omit the carry flag.

4 comments

brucehoult 382 days ago

Also, there is another way to do this while keeping 64 bit limbs. All variables uint64_t.

    s0 += a0;
    s1 += a1;
    s2 += a2;
    s3 += a3;
    
    c0 = s0 < a0; // RISC-V `sltu`
    c1 = s1 < a1;
    c2 = s2 < a2;
    
    if (s1 == -1) goto propagate0; // executes 1 time in 18,446,744,073,709,551,616
    check_s2:
    if (s2 == -1) goto propagate1; // ditto
    
    add_carries:
    s1 += c0;
    s2 += c1;
    s3 += c2;
    goto done;
    
    propagate0: c1 = c0; goto check_s2;
    
    propagate1: c2 = c1; goto add_carries;
    
    done:

The key insight here is that unless the sum at a particular limb position is all 1s the carry out from that position DOES NOT DEPEND on the carry in to that limb position, but only on whether the original add in that position produces a carry. If the sum is all 1s the the carry out is the same as the carry in.

If you express this with a conditional branch which is overwhelmingly predicted as not taken then the code should execute each block of instructions entirely in parallel, provided that multiple conditional branches can be predicted as not-taken in the same clock cycle.

One time in 2^64 it will execute very slowly.

With 4 limb numbers on a 4-wide machine this doesn't offer an advantage over `adc` as there are also 4 code blocks. But on, say, an 8-wide machine with 8 limb numbers you're really starting to gain.

It's probably not going to help on current x86_64, but might well do on Apple's M* series, where even the M1 is 8-wide, though it might be tricky to work around the Arm ISA.

When the 8-wide RISC-V Ascalon processor from Tenstorrent hits hopefully late this year or early 2026 we will really see. And others such as Ventana, Rivos, XiangShan.

This will work even better in a wide SIMD, if you have a fast 1-lane shift (Called slideup on RISC-V).

less_less 382 days ago

Neat, but if you're using this in cryptographic code (one of the main consumers of bignums), keep in mind that secret data reaching branches is usually a side-channel risk. Sure, it's only 1 time in 2^64 on random data, but if you're depending on that, then you have to consider whether an attacker can choose data that will make it happen more often.

If you can substitute a cmov without control flow then it's probably safer, e.g. c1 |= c0 & seq(s1,-1) or so, so long as you can make sure the compiler won't turn it into a branch.

It does add a data dependency though ...

brucehoult 381 days ago

Yes, for cryptography you'd like to have constant time, but this has to be an awfully low bandwidth channel!

A `cmov` will have the same serialisation problem as `adc` but on machines without carry it might still leave you better off than the obvious `add s,a,b; sltu co,s,a; add s,s,ci; sltu t,s,ci; or co,co,t`.

phkahler 382 days ago

I think you want to write:

  if (s1 == -1)
     c1 = c0;
  if (s2 == -1)
     c2 = c1;

These can become conditional moves on x86. I've often thought RISC-V should have implemented an IF instruction instead of compare and branch. IF would cause the next instruction to be executed conditionally while not needing a flag register at the ISA level. They could have required only branch and jump to be conditional, but it turns out conditional mov, load, and store are all very useful in real code.

brucehoult 382 days ago

The problem is that, as far as I know, a conditional move is going to introduce a data dependency from c0 to c1 to c2 that is the exact thing we are trying to get rid of. The cmov is a constant time instruction, not a speculated instruction like a conditional branch.

The entire point of what I did is that the two conditional branches will be predicted not taken, so the CPU will 99.9999999999999999946% of the time not even see the `c1 = c0` and `c2 = c1` instructions that introduce the sequential dependencies.

IshKebab 382 days ago

That sounds like it would be quite a pain to implement and program. E.g. what happens if there's an interrupt between the IF and the following instruction? You need to add a CSR to read/write the conditional state, similar to the vector control CSRs (vstart etc.). Hard to see how that extra complexity would be worth it.

Modern branch predictors are very good and most branches are very predictable.

adrian_b 382 days ago

There remain many frequently-encountered cases when carry-save addition is worse than addition using add-with-carry.

Neither of the 2 multi-word addition algorithms can replace the other, both have their use cases, so ADC/SBB instructions are included in any decent ISA, because the cost of adding them is negligible. A dedicated flag register is not necessary, some ISAs store the carry/borrow flags in general-purpose registers, when used.

Not having carry is by far not the worst feature of RISC-V. Much worse is not having an integer overflow flag, because the software workaround for detecting integer overflow, which is mandatory for any program that claims to be written in a safe way, lowers the attainable performance much more than the workarounds for not having carry.

phkahler 382 days ago

>> because the software workaround for detecting integer overflow, which is mandatory for any program that claims to be written in a safe way, lowers the attainable performance much more than the workarounds for not having carry

That's absurd. A better way is to ensure that your algorithms don't overflow. Detecting an overflow just means your code has to STOP which is usually not safe. It'd be insane to have conditionally executed code trying to figure out how to handle an overflow anywhere in code. Another problem is that flags are not even accessible from any language higher level then ASM. From a C perspective there are no flags.

dzaima 382 days ago

While there is no direct access to flags in standard C, you can nevertheless on gcc and clang compile with -ftrapv and get your signed integer arithmetic be overflow-checked. Or you can use __builtin_add_overflow & co and get access to the overflow flags that way. Rust debug builds trap on signed and unsigned integer overflow, and you can make release builds do so too.

While it'd be nice to have a formal proof that every single `a+b`, `a-b`, `a*b` in every codebase doesn't overflow, I'm sure you understand that that is rather impractical. (and really, it'd be nice! I've thought about having some compile-time-bounded-size integers where each addition increases the size, but multiplication is much less suitable for that, and it also means you can't have a loop adding to an accumulator. It's a rather non-trivial problem really - you might think that it'd be fine to have a loop over a list of objects and sum their sizes, but that can relatively easily overflow if the list references the same massive object many times, so can't even really abstract that)

dzaima 381 days ago

Oh, also, C23 added standard ckd_add & ckd_sub & ckd_mul for getting a boolean of whether the operation overflows (i.e. standard equivalent to __builtin_add_overflow)!

pjc50 382 days ago

This is all downstream of C omitting the carry flag, which means in practice it's very rarely used for the purpose of a carry.

immibis 382 days ago

C does, however, now have _BitInt

phkahler 382 days ago

Ugh, what a terrible thing to add to C.

immibis 381 days ago

Eh. It's probably meant to be typedefed away. The number of bits has to be a constant, so it's not like it's a really new language feature, just a different way to write uintXX_t that's not constrained to fixed values of X.

If you ask your compiler for 1-megabyte integers, that's on you.

NooneAtAll3 382 days ago

ha, I'm not the only one to think "so what's all the risc5 gmp fuss was about, if carry flag is slow anyway?"

brucehoult 382 days ago

Right.

Even at that time in 2021 I argued that serialising through a carry flag is limiting on wide machines, but there was very little RISC-V hardware available at the time and also GMP was not yet ported to RISC-V.

That has changed a bit now, and almost two months ago I tried the GMP project's own gmpbench on a few RISC-V boards.

I found that when comparing similar µarch at similar clock speed, in dual-issue in-order SiFive's U74 is very comparable to Arm's A53, and in small 3-wide OoO SiFive's P550 is significantly better than Arm's A72.

And that's not even using the kind of technique discussed in this post, but the full multi-instruction carry flag emulation criticised by Granlund.

https://www.reddit.com/r/RISCV/comments/1jsnbdr/gnu_mp_bignu...

It's going to be very interesting when the 8-wide OoO RISC-V cores come out, probably starting with Tenstorrent's Ascalon core which they expect to tape out in Q3 and they have said they want to get into as many hands as possible to accelerate RISC-V development, including in laptops, not only in servers or the like.