Hacker News new | ask | show | jobs
by nine_k 385 days ago
The main takeaway: doing more operations may be faster if they are largely independent, and thus can execute in parallel. Doing fewer operations may be slower if they are forced to execute serially due to data dependency.

This idea has wider applicability than operations on long integers.

4 comments

Yes. Another approach would be to use regular 64 bit chunks and speculatively execute each add with and without carry in parallel. Then select the correct variant based on carry result of less significant addition.

With double the amount of additions this allows for log(bits) propagation time (versus linear)

Wouldn’t that produce 2^n possible results to choose from, where n is the number of chunks? That seems like a lot of additional (he-he) instructions executed.
Nope. Just 2n: each chunk pair is added once without carry, and once won't carry=1.

For as long as radix=2, you either have a carry or you don't.

For a single addition, the radix is irrelevant, the carry is always zero or one: (r-1) + (r-1) = 2r - 2 < 2r.
But for a single addition you will also not save any time because you still need to do the select in series. The whole point of the radix 2^51 trick is that you can defer the normalization over several additions, but in order to do that your carry needs to be more than a single bit.
There's not just "result with carry" and "result without carry" but rather one variant of that per word of the input

... Which likely isn't that bad to code up.

What I didn't get about this: the technique shown seems to be about making sure that the ripple carry only happens once instead of N-1 times while adding N values. The carry operation is more complex, but this allows the actual addition to be parallelized.

But - you still have to split the input numbers into sets of 5 registers in the first place, right? So doesn't that need to be parallelizable somehow as well in order for this to be a net win?

That is paralelizable. Each of the 5 registers has no depence on the value of the others.
But when you split from 4 registers to 5, the bits for a given destination register may come from two different source registers.
That only means you need a couple of instructions to do each one [1]. The five output registers are each dependent on (at most) a pair of the four input registers, but not on each other.

[1] a left shift, a right shift, and an OR. Or just one 2->1 funnel shift instruction if your ISA has that e.g. arm64. And then ANDing with a 51 bit mask.

    and  out0, in0, 0x7FFFFFFFFFFFF
    extr out1, in1, in0, #51
    extr out2, in2, in1, #38
    extr out3, in3, in2, #25
    lsr  out4, in3, #12
    
    and  out1, out1, 0x7FFFFFFFFFFFF
    and  out2, out2, 0x7FFFFFFFFFFFF
    and  out3, out3, 0x7FFFFFFFFFFFF
Yep. Company called NVidia has been looking into that general idea. They seem to be getting some promising results so far, in a couple of different areas.
This rule scales up all the way to multi-node supercomputers / cloud. The overhead is negligible when you can employ 10.000 cores.
Actually the overhead crushes you when you employ 10000 cores. If the overhead of a process is 10% and the parallel part is 90%, then 2 cores will result in a run time of 55% = 10% + 90%/2 of the original time. And 10 cores will get you to 19%. And 100 cores to 10.9%. If you then buy 9900 more cores to bring it to a total of 10000, you just reduced the runtime from 10.9% to 10.009%. In other words, you increased your bill by a factor of 100 to reduce your run time by almost nothing.
You two are talking about different kinds of overhead though.
Abstractly, when any parallel system scales up large enough without cross checking or waiting between "threads", the cost of de-duplicating and merging the output will probably outweigh the advantage of producing new results in tandem. I think. That's just a hypothesis, but feels correct. With stuff like a-life distributed over lots of servers all converging on evolutionary answers to a problem, it's the collation and analysis layer that's most expensive and slow. Sharing more frequently / allowing more reliance on central historical truth slows each one down but avoids duplication and redundancy. I guess where that point is depends on what problem you're trying to solve.
Amdahl says no.