|
|
|
|
|
by meisanother
1369 days ago
|
|
Well, just got it. Thanks for the reference! A bit sad that 1974 papers are still behind a IEEE paywall... Edit:
Just finished reading it. I have to say that the generalization of 3.2 got a bit over me, but otherwise it's pretty amazing that they could define such a generalization.
Intuition for those type of problem is often to proceed one step at a time, N times. That it is provably doable in log2(N) is great, especially since it allows for a choice of the depth/number of processors you want to use for the problem. Hopefully next time I design a latency-constrained system I remember to look at that article |
|
Nah. Your next step is to read "Data parallel algorithms" by Hillis and Steele, which starts to show how these principles can be applied to code. (Much higher-level, easier to follow, paper. From ACM too, so its free since its older than 2000)
Then you realize that all you're doing is following the steps towards "Map-reduce" and modern parallel code and just use Map Reduce / NVidia cub::scan / etc. etc. and all the modern stuff that is built from these fundamental concepts.
Kogge and Stone's paper sits at the root of it all though.