Hacker News new | ask | show | jobs
by fluffything 2253 days ago
Thanks for the link to the HN discussion.

The Regent language (Legion is Regent's runtime) is yet another async task-graph-based PGAS language/run-time, similar to, e.g., HPX, but leaving out what in my opinion made Sequoia interesting (e.g. the memory hierarchy abstraction capabilities).

1 comments

As far as I understand, the idea in Regent is that the memory hierarchy is used dynamically, since the programmer cannot know about much of the memory hierarchy anyway at program writing time. So dynamic scheduling is used to construct the memory hierarchy and schedule at run-time. The typical use case for Regent is physicists writing e.g. weather / particle physics simulations, they are not aware of the details of L1/2/3/Memory / network/cluster ... sizes.

This is probably quite different from your use case.

Those applications do linear algebra, and pretty much any kind of linear algebra on the GPU, including simple vector-vector dot-products, requires a very careful usage of the memory hierarchy.

For doing a dot-product on the GPU, you need to take your vectors in global memory, and:

- split them into chunks that will be processed by thread blocks

- allocate shared memory for storing partial reductions from warps within a thread-block

- decide how many elements a thread operates on, and allocate enough registers within a thread

- do thread-block reductions on shared memory

- communicate thread-block reduction to global memory

- do a final inter-thread block reduction

A sufficiently-smart compiler can take a:

   sum = 0.
   for x,y in zip(x,y): sum += x+y
and transform it into an algorithm that does the above (e.g. a call to CUB).

But if you are not able to implement the above in the language, it suffices for the user to run into the need to do so once (e.g. your compiler does not optimize their slightly different reduction efficiently), for the value proposition of your language to suffer a huge hit (if I need to learn CUDA anyways, I might just use CUDA from the start; if I don't need performance, I wouldn't be using a super-expensive GPU, etc.).

This is IMO why CUDA is successful, and pretty much all other languages are not, maybe with the exception of OpenACC which has great CUDA interop (so you can start with OpenACC, and use cuda inline for a single kernel, if you need to).

The 6 requirements you list for doing a dot-product on the GPU can be phrased in abstract as a constraint solving problem where the number of thread blocks, the cost of communication etc are parameters.