| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nine_k 385 days ago
	The main takeaway: doing more operations may be faster if they are largely independent, and thus can execute in parallel. Doing fewer operations may be slower if they are forced to execute serially due to data dependency. This idea has wider applicability than operations on long integers.

4 comments

repelsteeltje 385 days ago

Yes. Another approach would be to use regular 64 bit chunks and speculatively execute each add with and without carry in parallel. Then select the correct variant based on carry result of less significant addition.

With double the amount of additions this allows for log(bits) propagation time (versus linear)

link

volemo 385 days ago

Wouldn’t that produce 2^n possible results to choose from, where n is the number of chunks? That seems like a lot of additional (he-he) instructions executed.

link

repelsteeltje 385 days ago

Nope. Just 2n: each chunk pair is added once without carry, and once won't carry=1.

For as long as radix=2, you either have a carry or you don't.

link

mananaysiempre 385 days ago

For a single addition, the radix is irrelevant, the carry is always zero or one: (r-1) + (r-1) = 2r - 2 < 2r.

link

account42 382 days ago

But for a single addition you will also not save any time because you still need to do the select in series. The whole point of the radix 2^51 trick is that you can defer the normalization over several additions, but in order to do that your carry needs to be more than a single bit.

link

brucehoult 385 days ago

Or see https://news.ycombinator.com/item?id=44133169

link

dgoldstein0 385 days ago

There's not just "result with carry" and "result without carry" but rather one variant of that per word of the input

... Which likely isn't that bad to code up.

link

zahlman 385 days ago

What I didn't get about this: the technique shown seems to be about making sure that the ripple carry only happens once instead of N-1 times while adding N values. The carry operation is more complex, but this allows the actual addition to be parallelized.

But - you still have to split the input numbers into sets of 5 registers in the first place, right? So doesn't that need to be parallelizable somehow as well in order for this to be a net win?

link

adgjlsfhk1 385 days ago

That is paralelizable. Each of the 5 registers has no depence on the value of the others.

link

zahlman 385 days ago

But when you split from 4 registers to 5, the bits for a given destination register may come from two different source registers.

link

brucehoult 384 days ago

That only means you need a couple of instructions to do each one [1]. The five output registers are each dependent on (at most) a pair of the four input registers, but not on each other.

[1] a left shift, a right shift, and an OR. Or just one 2->1 funnel shift instruction if your ISA has that e.g. arm64. And then ANDing with a 51 bit mask.

    and  out0, in0, 0x7FFFFFFFFFFFF
    extr out1, in1, in0, #51
    extr out2, in2, in1, #38
    extr out3, in3, in2, #25
    lsr  out4, in3, #12
    
    and  out1, out1, 0x7FFFFFFFFFFFF
    and  out2, out2, 0x7FFFFFFFFFFFF
    and  out3, out3, 0x7FFFFFFFFFFFF

link

CamperBob2 385 days ago

Yep. Company called NVidia has been looking into that general idea. They seem to be getting some promising results so far, in a couple of different areas.

link

rollcat 385 days ago

This rule scales up all the way to multi-node supercomputers / cloud. The overhead is negligible when you can employ 10.000 cores.

link

credit_guy 385 days ago

Actually the overhead crushes you when you employ 10000 cores. If the overhead of a process is 10% and the parallel part is 90%, then 2 cores will result in a run time of 55% = 10% + 90%/2 of the original time. And 10 cores will get you to 19%. And 100 cores to 10.9%. If you then buy 9900 more cores to bring it to a total of 10000, you just reduced the runtime from 10.9% to 10.009%. In other words, you increased your bill by a factor of 100 to reduce your run time by almost nothing.

link

volemo 385 days ago

You two are talking about different kinds of overhead though.

link

noduerme 385 days ago

Abstractly, when any parallel system scales up large enough without cross checking or waiting between "threads", the cost of de-duplicating and merging the output will probably outweigh the advantage of producing new results in tandem. I think. That's just a hypothesis, but feels correct. With stuff like a-life distributed over lots of servers all converging on evolutionary answers to a problem, it's the collation and analysis layer that's most expensive and slow. Sharing more frequently / allowing more reliance on central historical truth slows each one down but avoids duplication and redundancy. I guess where that point is depends on what problem you're trying to solve.

link

hinkley 385 days ago

Amdahl says no.

link