Hacker News new | ask | show | jobs
by eholk 4737 days ago
I really should update the user guide, since the language has changed a lot in the last year.

I have an earlier post that discusses a couple of the optimizations that Harlan does: http://blog.theincredibleholk.org/blog/2013/06/10/some-simpl...

To me the win for Harlan over CUDA is its region system, which lets you work with more intricate pointer structures in the GPU. For example, there is an interpreter for the Lambda Calculus as one of the test cases, which would be much harder to do in straight CUDA: https://github.com/eholk/harlan/blob/master/test/lambda3.kfc

Very soon, Harlan will have support for higher order procedures, which is also not available in CUDA.

2 comments

How do you support higher order procedures? Do you do control flow analysis & defunctionalization or are these "real" higher order functions?
It's basically defunctionalization. It's hard to get access to function pointers on the GPU to do "real" higher order functions.
What version of CUDA are you talking about. I though the latest version supported indirect function pointers (finally).
That's good news! Did that change with Kepler? Any links to more info?
I wish they'd at least give us a computed goto...but alas, every switch is destined for predicated branching.
Thanks! What kind of performance do you get with some of those more intricate pointer structures? My own research right now is focused on how to build some common abstract data types with less pointer indirection, so as to be more easily vectorizable.
Lately I've been more focused on getting the features running and less on performance, so I haven't completely explored the impact of using these structures.

I'd be interested to hear more about what you're doing. Do you have any links you could share?

I've actually only really just started, I'm less than a year into my PhD, so I don't have anything published yet. My current project involves doing some non-trivial calculations on a large matrix of multi-precision integers - the obvious way to lay out the data (an array of dynamically resizable vectors) is dead wrong for GPU (the memory bandwidth is completely swamped by fetching from a different pointer for each thread - I got 100x slowdown from the CPU reference implementation), so right now I'm redoing the experiments based on a vector of fixed-length arrays (the 0th elements of all the vectors from the first formulation go in one array, the 1st elements in the next, and so forth).
"Array of structs" -> "Struct of arrays"?

I guess if you generalize this transformation sufficiently you end up doing whole-program flattening like NESL and DPH. Or is there a less intrusive way to pull that off?

Sounds like in his case the unboxed vector soa rep might be enough, though I'd enjoy hearing more about this too. (Doing some gpu work with Haskell a bunch this fall)
I don't think there's a less intrusive way to pull it off, but with sufficient compiler support you should be able to make it fairly transparent to the programmer.