Hacker News new | ask | show | jobs
by cs702 931 days ago
Yes! That's pretty much how I would have thought about this with my puny little brain.

But... I'm not sure 1 parallel scan with 2 floating-point mults and 1 sum per step is faster than 2 parallel scans with 1 sum per step. I don't know which is better in theory or in practice. And then, wouldn't we have to think about how to sidestep all the numerical issues with the cumulative products?

Or am I missing something?

2 comments

On modern multicore hardware this will be memory-bound; the amount of computation per byte is pretty small (just a few arithmetic instructions on average). My intuition is that the single scan will be faster because it requires a much smaller number of cache misses.

And yes, definitely, the numerical accuracy thing could be a problem. I suspect it wouldn't be too difficult to work around, but I can't say for sure off the top of my head.

Memory pressure is even worse on GPUs. I did some work to generalize Blelloch to 2D parallel prefix sums for integral image computation back in 2008 [1], and the number of memory accesses really dominates. On a GPU, for sufficiently small problems the number of passes matters more, and it is worth using a simpler, non-work-efficient algorithm to reduce setup overheads.

[1] https://people.xiph.org/~tterribe/pubs/gpusurf.pdf Section III.A

Thank you, this is helpful... although my initial thought was this may be useful for a rather different type of application: These new AI models, "linear RNNs," that have many layers, each layer processing large batches of sequences in parallel, each sequence with potentially up to millions of tokens, each token with thousands of features. Definitely not small-scale. Hard to reason about, at least for me.
> My intuition is that the single scan will be faster because it requires a much smaller number of cache misses.

Thank you, that's helpful. Your intuition may be right, but I'm not sure either. Too hard to reason about, at least for me. Maybe the thing to do is test both... unfortunately that would involve the hassle of writing code for executing the single scan efficiently on GPUs.

> And then, wouldn't we have to think about how to sidestep all the numerical issues with the cumulative products?

I think so. And also potential issues with all the sums. :) (see: "compensated summation") god I love this shit. :)