| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by shwestrick 930 days ago

It's worth noting that you can solve these linear recurrences, `x(t) = a(t)x(t-1) + b(t)`, using a single parallel prefix sum where the elements are the input tuples `(a(t), b(t))`.

The following operator called `combine` works. It's associative over these tuples; the final result will be the final value of the second component. Altogether, this gives you O(n) work and O(log n) span, but using just a single parallel prefix kernel, which may be more efficient in practice.

    combine ((a1, b1), (a2, b2))  =  (a1 * a2, b1 * a2 + b2)

For example, here's a C++ implementation: https://github.com/MPLLang/parallel-ml-bench/blob/main/cpp/l... And, here's an implementation in a functional language: https://github.com/MPLLang/parallel-ml-bench/blob/main/mpl/b...

I'm pretty sure this generalizes, too, to abstract multiplications and additions in any field (at first glance, seems like it should but I haven't done the formal proof yet).

Anyway, it would be interesting to compare this against the solution in the arxiv paper.

----

EDIT: ah, good, this is already being discussed here: https://github.com/glassroom/heinsen_sequence/issues/1

And, for reference, I learned this algorithm from Guy Blelloch; see Sec 1.4.1 of https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf

2 comments

cs702 930 days ago

Yes! That's pretty much how I would have thought about this with my puny little brain.

But... I'm not sure 1 parallel scan with 2 floating-point mults and 1 sum per step is faster than 2 parallel scans with 1 sum per step. I don't know which is better in theory or in practice. And then, wouldn't we have to think about how to sidestep all the numerical issues with the cumulative products?

Or am I missing something?

link

shwestrick 929 days ago

On modern multicore hardware this will be memory-bound; the amount of computation per byte is pretty small (just a few arithmetic instructions on average). My intuition is that the single scan will be faster because it requires a much smaller number of cache misses.

And yes, definitely, the numerical accuracy thing could be a problem. I suspect it wouldn't be too difficult to work around, but I can't say for sure off the top of my head.

link

derf_ 929 days ago

Memory pressure is even worse on GPUs. I did some work to generalize Blelloch to 2D parallel prefix sums for integral image computation back in 2008 [1], and the number of memory accesses really dominates. On a GPU, for sufficiently small problems the number of passes matters more, and it is worth using a simpler, non-work-efficient algorithm to reduce setup overheads.

[1] https://people.xiph.org/~tterribe/pubs/gpusurf.pdf Section III.A

link

cs702 929 days ago

Thank you, this is helpful... although my initial thought was this may be useful for a rather different type of application: These new AI models, "linear RNNs," that have many layers, each layer processing large batches of sequences in parallel, each sequence with potentially up to millions of tokens, each token with thousands of features. Definitely not small-scale. Hard to reason about, at least for me.

link

cs702 929 days ago

> My intuition is that the single scan will be faster because it requires a much smaller number of cache misses.

Thank you, that's helpful. Your intuition may be right, but I'm not sure either. Too hard to reason about, at least for me. Maybe the thing to do is test both... unfortunately that would involve the hassle of writing code for executing the single scan efficiently on GPUs.

link

loxias 930 days ago

> And then, wouldn't we have to think about how to sidestep all the numerical issues with the cumulative products?

I think so. And also potential issues with all the sums. :) (see: "compensated summation") god I love this shit. :)

link

leiroigh 930 days ago

This is very very well known. Cf https://en.wikipedia.org/wiki/Affine_group

I don't see how people should glorify this with the word "algorithm". It is a trivial undergrad homework exercise, once you give the hint "use parallel reduce / fold / prefixsum".

This may involve more interesting tradeoffs if you deal with large or sparse matrices or matrix-free operators.

link

dang 930 days ago

If you know more than others do, that's great, but instead of posting putdowns, please share some of what you know so the rest of us can learn.

The trouble with comments like this is that they degrade discussion put others down without really teaching us anything.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...

https://news.ycombinator.com/newsguidelines.html

link