| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by meisanother 1369 days ago

Well, just got it. Thanks for the reference!

A bit sad that 1974 papers are still behind a IEEE paywall...

Edit: Just finished reading it. I have to say that the generalization of 3.2 got a bit over me, but otherwise it's pretty amazing that they could define such a generalization. Intuition for those type of problem is often to proceed one step at a time, N times.

That it is provably doable in log2(N) is great, especially since it allows for a choice of the depth/number of processors you want to use for the problem. Hopefully next time I design a latency-constrained system I remember to look at that article

1 comments

dragontamer 1369 days ago

> Hopefully next time I design a latency-constrained system I remember to look at that article

Nah. Your next step is to read "Data parallel algorithms" by Hillis and Steele, which starts to show how these principles can be applied to code. (Much higher-level, easier to follow, paper. From ACM too, so its free since its older than 2000)

Then you realize that all you're doing is following the steps towards "Map-reduce" and modern parallel code and just use Map Reduce / NVidia cub::scan / etc. etc. and all the modern stuff that is built from these fundamental concepts.

Kogge and Stone's paper sits at the root of it all though.

meisanother 1369 days ago

From Data Parallel algorithms by Hillis and Steele conclusion: 'if the number of lines of code is fixed and the amount of data is allowed to grow arbitrarily, then the ratio of code to data will necessarily approach zero. The parallelism to be gained by concurrently operating on multiple data elements will therefore be greater than the parallelism to be gained by concurrently executing lines of code'

I feel like this sums up the way of thinking one must have in this paradigm.

It's funny because when designing digital hardware, you're kind-of trained to see things under that angle, since often the commands you write will expand in that tree-like structure, but to gates/ALU instead of data.

Then managing the data-path often piggy-backs on the hardware structure you just generated.

I feel like I have a overall grasp on these concept, yet there's so much interesting results that I don't know about...