| HN Mirror

Those applications do linear algebra, and pretty much any kind of linear algebra on the GPU, including simple vector-vector dot-products, requires a very careful usage of the memory hierarchy.

For doing a dot-product on the GPU, you need to take your vectors in global memory, and:

- split them into chunks that will be processed by thread blocks

- allocate shared memory for storing partial reductions from warps within a thread-block

- decide how many elements a thread operates on, and allocate enough registers within a thread

- do thread-block reductions on shared memory

- communicate thread-block reduction to global memory

- do a final inter-thread block reduction

A sufficiently-smart compiler can take a:

   sum = 0.
   for x,y in zip(x,y): sum += x+y

and transform it into an algorithm that does the above (e.g. a call to CUB).

But if you are not able to implement the above in the language, it suffices for the user to run into the need to do so once (e.g. your compiler does not optimize their slightly different reduction efficiently), for the value proposition of your language to suffer a huge hit (if I need to learn CUDA anyways, I might just use CUDA from the start; if I don't need performance, I wouldn't be using a super-expensive GPU, etc.).

This is IMO why CUDA is successful, and pretty much all other languages are not, maybe with the exception of OpenACC which has great CUDA interop (so you can start with OpenACC, and use cuda inline for a single kernel, if you need to).