Hacker News new | ask | show | jobs
by zahlman 381 days ago
What I didn't get about this: the technique shown seems to be about making sure that the ripple carry only happens once instead of N-1 times while adding N values. The carry operation is more complex, but this allows the actual addition to be parallelized.

But - you still have to split the input numbers into sets of 5 registers in the first place, right? So doesn't that need to be parallelizable somehow as well in order for this to be a net win?

1 comments

That is paralelizable. Each of the 5 registers has no depence on the value of the others.
But when you split from 4 registers to 5, the bits for a given destination register may come from two different source registers.
That only means you need a couple of instructions to do each one [1]. The five output registers are each dependent on (at most) a pair of the four input registers, but not on each other.

[1] a left shift, a right shift, and an OR. Or just one 2->1 funnel shift instruction if your ISA has that e.g. arm64. And then ANDing with a 51 bit mask.

    and  out0, in0, 0x7FFFFFFFFFFFF
    extr out1, in1, in0, #51
    extr out2, in2, in1, #38
    extr out3, in3, in2, #25
    lsr  out4, in3, #12
    
    and  out1, out1, 0x7FFFFFFFFFFFF
    and  out2, out2, 0x7FFFFFFFFFFFF
    and  out3, out3, 0x7FFFFFFFFFFFF