Hacker News new | ask | show | jobs
by brucehoult 1730 days ago
I completely agree with you on 0/1 vs 0/-1 for SLT and you can easily find me saying so. For example: https://lists.riscv.org/g/tech-bitmanip/message/496

It was an error, though a rather minor one, to follow the C language so closely. I can and have pointed out other minor mistakes in RISC-V in the past -- none of them serious enough to abandon it and start over.

I'll quote myself from there, below.

32 bits is not such a huge instruction. ARM decided it's good enough for their new(ish) 64 bit ISA, and it's about the average size of x86_64 instructions.

Original RISC-V (v1.0) has only and exactly the instructions needed to implement C. That's enough for many or most applications, and will be available as a support option forever. The upcoming RVA22 specification for Applications Processors, which will be ratified before the end of the year includes an SVE-like vector extension and also Bit Manipulation extensions (along with many others). The Zbb (Basic bit-manipulation) extension includes cpop along with clz and ctz and rotate. There is also andn, orn, xnor, max, maxu, min, minu, sext.b, sext.h, zext.h, and rev8 (reverse bytes in a register). Plus a unique instruction orc.b which replaces any non-zero byte in the source operand with all ones. There is also scalar crypto and cache manipulation (prefetch, flush etc).

Perhaps RVA22 is your hypothetical Risc-6.

-----

There are five reasons you might use SLT / SLTU, in (I think) descending order of how common they are, and the implications had -1 been used instead of 1:

1) to generate a zero/non-zero value. No difference.

2) to generate a mask. Using 0 and -1 is better, saving a NEG or a subtract 1, depending on whether you reverse the condition or not.

3) to generate a value that can be AND / OR / XOT etc with other such values. No difference.

4) to assign to a canonical C/C++ true/false, or mix with them using AND / OR / XOR. Worse -- have to do an ANDI #1 before using the final result.

5) to generate a canonical C true/false and add or subtract it from something. No difference. Just flip add to subtract or vice versa.

Interestingly, a time when you do want 0 or 1 is the examples in the original superoptimiser paper from 1987.

https://web.stanford.edu/class/cs343/resources/superoptimize...

They first considered the function:

  int signum (int x) {
    if(x > 0) return I;
    else if(x < 0} return -I;
    else return 0;
  ) 
They showed the superoptimiser finding the following unexpected 68020 sequence, making use of the carry flag:

  (x in dO)
  add.l d0,d0 ;add dO to itself
  subx.l dl,dl ;subtract (dl + Carry) from dl
  negx.l dO  ;put (0 - dO - Carry) into dO
  addx.l dl,dl ;add (dl + Carry) to dl
  (signum(x) in dl} (4 instructions} 
This is much more straightforward on RISC-V:

  (x in a0)
  slt a1,a0,zero  # a1 = 1 if x is negative, 0 if 0 or positive
  slt a0,zero,a0 # a0 = 1 if x is positive, 0 if 0 or negative
  sub a0,a0,a1 # 1-0 = 1 if positive, 0-0 = 0 if zero, 0-1 = -1 if negative
-----
3 comments

> This is much more straightforward on RISC-V

AIUI, if SLT returns 0 or -1 you can then reverse the arguments to SUB and get a correct result. If you return the result in a1 you can also keep the 2-operand compressed form of SUB, so there's effectively no difference. Equivalently, you can keep the SUB insn unchanged (thus using a 2-operand form to return in a0) while flipping the previous SLT insns: SLT a1, zero, a0; SLT a0, a0, zero.

True, it would be better if they had defined RISC-V's C and C++ ABI to make 'true' physically equal to -1, negating or adding 1 to it when actually necessary to treat it as an int value. That would be rare.

The very late addition of the reified B extensions (and others) will be a continuing problem, as builds will not be able to count on them having been implemented. (Trap emulation would be much worse than useless.) The lack of rotate operations in the base instruction set is a problem for implementing modern encryption systems. On embedded chips likely to appear in routers and switches, "extensions" such as the Bs are especially likely to be omitted.

It would not be necessary to abandon the work on RISC-V to do a Risc-6. Most of the work done could carry over.

No more of a problem than in any other ISA that has seen incremental improvements -- which is all of them.

Modern x86_64 OSes such as Windows and Linux run on everything back to the original Opteron and Athlon 64 from 2003, which don't have POPCNT and LZCNT. Those were implemented by AMD starting with Bobcat and Bulldozer in 2011. Intel added POPCNT in Nehalem in 2008 and LZCNT in Haswell in 2013.

Aarch64 got both from the start, but there are other things added in ARMv8.1-A through ARMv8.8-A (and ARMv9) which are presumably also useful to certain software.

Embedded chips used in routers and switches will take exactly the extensions useful to them and none that aren't. If Zbb is useful to them then they will certainly include it -- that's why the extensions are specified so finely with three non-overlapping extensions for BitManip being defined this year. Applications processors running shrink-wrapped OSes are required to take all the extensions in RVA22 (or none). The embedded world picks and chooses what they want.

In other words, as much of a problem as in any ISA that has seen incremental improvements. We are supposed to learn from the mistakes of our forbears, not repeat them verbatim.

Chips used in routers and switches will be exactly what is cheapest, just as now, regardless of what performs best or adequately. Thus, they will lack B extensions, howsoever useful they might have been.

> In other words, as much of a problem as in any ISA that has seen incremental improvements.

That's way overstated. RISC-V is still an amazingly clean and elegant design, placing extreme focus on technical excellence and on making effective use of limited insn encoding space. (Just look at how cautious the ratification of B and V has been - some of that was due to wanting to maximize feasible overlap between B and other exts, so as to avoid wasting even the smallest fractions of insn space). Tiny warts like SLT returning 0/1 as opposed to 0/-1 don't change that in any way.

Way understated, rather. If you have AVX, you know you also have POPCNT and everything else older than AVX. Having thing A on RISC-V tells you nothing about whether you have thing B or C, or vice versa. The set of possible targets is exponential in the number of extensions, rather than strictly linear in the number of additions, as seen in existing chips.

"Tiny warts" reveal mindset: how aware are the designers of the consequences of their choices? Each is a clue. Lack of rotate and popcount instructions in the core instruction set provides a clue. Expectation that five-instruction sequences can be fused might be another. (When your instructions are already 4 bytes or more, each, five means at least 20 bytes for a single primitive operation.) The extremely complicated extensions landscape is another.

Rotate and popcount are very specialised instructions. The vast majority of software doesn't use them at all, or uses them so infrequently that a software implementation is fine.

You are confusing embedded applications, which have huge flexibility with RISC-V, and standard operating systems with packaged software.

For the next few years (5?) standard operating systems have to support exactly two choices:

- RV64GC

- RVA22

RVA22 includes all the bit manipulation instructions, vectors, cache management, scalar crypto, and some other stuff. You can't pick and choose -- you have to support it all.

If you are making an embedded appliance on the other hand you can pick and choose exactly what extensions you need (a huge number of combinations, as you say), specify a core with exactly those extensions, build a chip around that with the other IP blocks you need, and tell your compiler which extensions you have. You compile all your software yourself, whether bare metal, using an RTOS, or a minimal Linux such as builtroot or yocto. There is zero confusion because you know what you have and you have what you need -- no more and no less.

No one who knows what they are talking about is talking about fusing five-instruction sequences. That's a total red herring.

I saw mask generation use case only once for constant time ternary operator in cryptography, but I use booleans regularly - your use case 4. In case of cryptography it's x25519 algorithm, but the condition controlling the conditional swap is a bit extracted from the private key and the algorithm uses 255 bits of the key sequentially.