Hacker News new | ask | show | jobs
by meisanother 1369 days ago
There's also something beautiful about seeing or creating a Kogge-Stone implementation on silicon.

I know it was one of the first time I thought to myself: this is not just a straightforward pipeline, yet it all follows such a beautifully geometrical interconnect pattern. Super fast, yet very elegant to layout.

1 comments

The original paper is a masterpiece to read as well, if you haven't read it.

"A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equation", by Kogge and Stone.

It proves the result for _all_ associative operations (technically, a class slightly larger than associative. Kogge and Stone called this a "semi-associative" operation).

Well, just got it. Thanks for the reference!

A bit sad that 1974 papers are still behind a IEEE paywall...

Edit: Just finished reading it. I have to say that the generalization of 3.2 got a bit over me, but otherwise it's pretty amazing that they could define such a generalization. Intuition for those type of problem is often to proceed one step at a time, N times.

That it is provably doable in log2(N) is great, especially since it allows for a choice of the depth/number of processors you want to use for the problem. Hopefully next time I design a latency-constrained system I remember to look at that article

> Hopefully next time I design a latency-constrained system I remember to look at that article

Nah. Your next step is to read "Data parallel algorithms" by Hillis and Steele, which starts to show how these principles can be applied to code. (Much higher-level, easier to follow, paper. From ACM too, so its free since its older than 2000)

Then you realize that all you're doing is following the steps towards "Map-reduce" and modern parallel code and just use Map Reduce / NVidia cub::scan / etc. etc. and all the modern stuff that is built from these fundamental concepts.

Kogge and Stone's paper sits at the root of it all though.

From Data Parallel algorithms by Hillis and Steele conclusion: 'if the number of lines of code is fixed and the amount of data is allowed to grow arbitrarily, then the ratio of code to data will necessarily approach zero. The parallelism to be gained by concurrently operating on multiple data elements will therefore be greater than the parallelism to be gained by concurrently executing lines of code'

I feel like this sums up the way of thinking one must have in this paradigm.

It's funny because when designing digital hardware, you're kind-of trained to see things under that angle, since often the commands you write will expand in that tree-like structure, but to gates/ALU instead of data.

Then managing the data-path often piggy-backs on the hardware structure you just generated.

I feel like I have a overall grasp on these concept, yet there's so much interesting results that I don't know about...