Hacker News new | ask | show | jobs
by jcalvinowens 779 days ago
One line of inline assembler makes the bignum school multiplication trivial: https://github.com/jcalvinowens/toy-rsa/blob/master/bfi.c#L4...

If I could go back in time and change one thing about the C language, I would add some notion of expanding multiplication. It's a shame Rust doesn't have it either. Hardware support is everywhere: hell, the Cortex M0 doesn't do division, but it has an expanding multiply!

This is from a very ugly little toy implementation of RSA I wrote a long time ago: https://github.com/jcalvinowens/toy-rsa

I found I could get away with the Fermat test, because the algorithm doesn't work if the primes aren't actually prime: the Fermat test is fast, and an encrypt/decrypt eliminates the extremely minuscule chance either prime is a fermat liar.

But I don't know if it can be proven there do not exist non-prime P/Q values which produce an RSA keypair which can successfully encrypt/decrypt messages. I'm sure this isn't kosher for a real implementation, but I've never found an answer.

6 comments

Curiously, C actually has bignums. Now. In C23, they added a _BitInt(N) type (e.g., "_BitInt(1024)" for a 128-byte type).

The compiler support for that is limited, though. To let N be >128 in Clang, -fexperimental-max-bitint-width=N flag can be provided. If N>128 and _BitInt(N) is divided by something, the compiler will just crash, but +, -, * all work as expected.

Neat. Maybe in ten years I'll actually be able rely on it existing in code I write!
With all the things the C standard lacks... I would not have expected this to be included!
> If I could go back in time and change one thing about the C language, I would add some notion of expanding multiplication.

Zig makes this relatively easy: it has a @mulWithOverflow intrinsic, which returns an overflow bit with the result, and it has integers up to (u|i)65535. So depending on what you're doing, you can either detect overflow and then upcast, or upcast first and then optionally truncate.

It also has saturating multiply as a separate operator *|, or wrapping with *%, for when those are the semantics you want. Otherwise overflow is safety-checked undefined behavior, which will panic in Debug and ReleaseSafe build modes.

At least on x86, there's no real point in detecting truncation before upcasting, since expanding multiplication yields both the high and low halves in a single instruction. At best, you'd be adding branches that the processor can insert just as well itself. (Unless we're talking about things like 128x128-bit multiplication, which would be a different story; it would likely depend on your expected input distribution.)

As for returning an overflow bit, Rust has had this since forever with its overflowing_OP() methods, and C23 has recently added an <stdckdint.h> header with a bunch of ckd_OP() macros that return an overflow bit.

> But I don't know if it can be proven there do not exist non-prime P/Q values which produce an RSA keypair which can successfully encrypt/decrypt messages. I'm sure this isn't kosher for a real implementation, but I've never found an answer.

If p and q are coprime Carmichael numbers then RSA will still successfully encrypt and decrypt messages, though this will be less secure as p*q will have smaller prime factors and thus be easier to factor.

I believe that both in most C compilers and in Rust, casting to a bigger type and multiplying does produce the exact opcode.
Of course. But that's ugly and hard to follow compared to the C implementation I linked (at least, in my opinion).

And int128 support isn't ubiquitous.

Even so, if you wrap it in a function and make sure to document what's happening, I would argue that a version without asm() is preferable. It gives the compiler more leeway for optimization and is easier to read for someone who isn't well-versed in GCC's weird asm syntax.

See the generated code for these two alternative implementations: https://godbolt.org/z/3vno6G46j

That's actually just a bug: the x86 asm constraints were needlessly forcing the compiler to use rdx as the input register for the multiply instruction.

With that fixed, it's the same: https://godbolt.org/z/M571P371K

But after staring at this for a little while, I realized simply reversing the order of the fields in the dword struct will make the result from the multiply instruction already match the way 128-bit structs are returned in registers under the x86_64 calling convention! This is quite a bit better: https://godbolt.org/z/MbfG63vej

Also worth noting the asm makes nicer code on arm64: https://godbolt.org/z/7eas6s9vK

https://github.com/jcalvinowens/toy-rsa/commit/59ef9ea905dbd... https://github.com/jcalvinowens/toy-rsa/commit/ceaa8a4dd0834...

Yes I also noticed the bug in the constraints and came to the same conclusion that changing them yields the same code for both implementations. But I figured that the fact that a function with a single asm instruction has a bug kind of supports my point. :)

As for the codegen on ARM, I don't think your asm implementation is correct there either. As far as I can tell the UMULL instruction only cares about the least significant 32 bits in the source registers, regardless if you're on a 64-bit CPU: https://developer.arm.com/documentation/ddi0602/2024-03/Base...

Very much disagree with your point about constraints. A bug existing in 10+ year old code that has never really been tested and run by four people doesn't support any point. In real life one actually checks these things, it's not that complex :)

Case in point: counting the f's in your constants took me longer than finding that constraint bug did. You would argue ULLONG_MAX would fix that, I suppose.

You're right, that's the 32-bit arm instruction, doh. In my defense, this code was written before 64-bit ARM existed!

On 64-bit hosts using the full widening multiplier requires that you use 128 bit integers-- which aren't portable. :( (in particular, MSVC doesn't have it last I checked). It's the obvious thing to do where they exist however.
Yeah, a lack of guaranteed existence of uint128_t is problematic for C. (But __int128 is reasonably portable!) Rust always has u128/i128 in comparison.
The original Pretty Good Privacy (PGP) by Philip Zimmermann in 1994 used only a sieve that divided by all known 16-bit primes (this table was produced using the Sieve of Eratosthenes), followed by the Fermat test.
> I would add some notion of expanding multiplication.

C type promotion is complicated enough!

Intrinsics expose this adequately, don't you think?

I want an explicit operation that gives the upper and lower half of the result separately. This is ugly, but like:

  hi, lo = val1 *** val2;
It isn't necessarily intrinsically supported on all platforms: int128 is not part of the original standard. Sometimes long long is the same width as long.

I think the concept generalizes to non-binary and non-twos-complement CPUs easily enough.