|
|
|
|
|
by Joker_vD
132 days ago
|
|
And having a CPU with "128-by-64" or "64-by-32" division instruction (such as x86/x64) simplifies the algorithm even further. Inputs: uint[128] N, uint[128] D
Outputs: uint[128] Q, uint[128] R
such that Q ≤ N, R < D, and N = Q * D + R
Uses: (uint[64] quot, uint[64] rem) hw_divmod(uint[128] dividend, uint[64] divisor)
if D.high = 0:
# long division; quotient can be 128-bit wide, but remainder will fit in 64 bits
uint[192] N' := N ≪ lzcnt(D.low)
uint[64] D' := D ≪ lzcnt(D.low)
uint[64] Q', uint[64] R' := hw_divmod(N'.high ≪ 64 + N'.mid, D')
uint[64] Q'', uint[64] R'' := hw_divmod(R' ≪ 64 + N'.low, D')
Q := Q' ≪ 64 + Q''
R := R'' ≫ lzcnt(D.low)
else:
# Quotient will fit in 64 bits, but remainder could be 128-bits wide
uint[192] N' := N ≪ lzcnt(D.high)
uint[128] D' := D ≪ lzcnt(D.high)
uint[64] Q', uint[64] R' := hw_divmod(N'.high ≪ 64 + N'.mid, D'.high)
uint[128] R'' := R' ≪ 64 + N'.low
uint[128] T := Q' * D'.low
if R'' < T:
Q := Q' - 1
R := D' + R'' - T
else:
Q := Q'
R := R'' - T
R := R ≫ lzcnt(D.high)
|
|