|
It's worth noting that you can solve these linear recurrences, `x(t) = a(t)x(t-1) + b(t)`, using a single parallel prefix sum where the elements are the input tuples `(a(t), b(t))`. The following operator called `combine` works. It's associative over these tuples; the final result will be the final value of the second component. Altogether, this gives you O(n) work and O(log n) span, but using just a single parallel prefix kernel, which may be more efficient in practice. combine ((a1, b1), (a2, b2)) = (a1 * a2, b1 * a2 + b2)
For example, here's a C++ implementation: https://github.com/MPLLang/parallel-ml-bench/blob/main/cpp/l... And, here's an implementation in a functional language: https://github.com/MPLLang/parallel-ml-bench/blob/main/mpl/b...I'm pretty sure this generalizes, too, to abstract multiplications and additions in any field (at first glance, seems like it should but I haven't done the formal proof yet). Anyway, it would be interesting to compare this against the solution in the arxiv paper. ---- EDIT: ah, good, this is already being discussed here: https://github.com/glassroom/heinsen_sequence/issues/1 And, for reference, I learned this algorithm from Guy Blelloch; see Sec 1.4.1 of https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf |
But... I'm not sure 1 parallel scan with 2 floating-point mults and 1 sum per step is faster than 2 parallel scans with 1 sum per step. I don't know which is better in theory or in practice. And then, wouldn't we have to think about how to sidestep all the numerical issues with the cumulative products?
Or am I missing something?