|
Thanks. Amazing overview. abstract away what they
cannot know
Here I'm a bit surprised. For theoretical reasons (I've never gotten
my hands dirty with GPU implementations, I'm afraid to admit), I would
have expected that it is highly useful to have a language interface
that allows run-time (or at least compile-time) reflection on:- Number of levels (e.g. how many layers of caches). - Size of levels. - Cost of memory access at each level. - Cost of moving data from a level to the next level up/down. The last 3 numbers don't have to be absolute, I imagine, but can be
relative, e.g.: size(Level3) = 32 * size(Level2). This data would be useful to
decide how to partition compute jobs as you describe, and do so in a
way that is (somewhat) portable.
There are all manner of subtle issues, e.g. what counts as cost of memory access and data movements (average, worst case, single byte, DMA ...), and what the compiler and/or runtime should do (if anything) if they are are violated. In abstract terms: what is the semantics of the language representation of the memory hierarchy.
Another subtle but important question is: what primitives should a language provide to access a memory level, and which ones to move between levels. An obvious choice is to treat each level as an array, and have DMA-like send/receives to move blocks of data between levels. Is that a good idea? Equally subtle, and I already alluded to this above, is when to make this information available. Since the processor doesn't change during computing, I imagine that using a multi-stage meta-programming setup (see e.g. [1] for a rationale), might be the right framework: you have a meta-program, specialising the program doing the compute you are interested in. C++ use template for program specialisation, but
C++'s interface for meta-programming is not easy to use. It's possible to do much better. As you wrote above, programming
"in those languages is hard and error prone", and the purpose of language primitives is to catch errors early.
What errors would a compiler / typing system for such a language catch, ideally without impeding performance? [1] Z. DeVito, J. Hegarty, A. Aiken, P. Hanrahan, J. Vitek, Terra: A Multi-Stage Language for High-Performance Computing. https://cs.stanford.edu/~zdevito/pldi071-devito.pdf |
That's an interesting thought. I'm not sure I agree, maybe?
The user job is to express how to partition the problem. To do that properly, they need to know, e.g., "how many bytes can I fit in the next partition per unit-of-compute", so the langue/run-time has to tell them that.
I don't know if knowing the costs of memory access at each level or across levels is useful. It is a reasonable assumption that, at the next level, memory accesses are at least one order of magnitude faster, and that synchronizing across the hierarchy is very expensive and should be minimized. That's a good assumption to write your programs with, and knowing actual costs do not really help you there.
> Another subtle but important question is: what primitives should a language provide to access a memory level, and which ones to move between levels. An obvious choice is to treat each level as an array
I think CUDA's __shared__ memory is quite good. E.g. per kernel, a way to obtain memory in the level of the hierarchy that the kernel is currently working on. Nobody has extended CUDA to multiple levels because there hasn't been a need, but I expect these programs to be "recursive" in the sense that they recursively partition a problem, and for __shared__ memory on each recursion level to give you memory at a different level of the hierarchy.
To move memory across levels of the hierarchy, you would just use raw pointers, with appropriate reads/writes (e.g. atomic ones). The exact instructions that get emitted would then depend on which level of the hierarchy you are. For that the compiler needs to know something about the architecture it is compiling for, but that seems unavoidable.
> [1] Z. DeVito, J. Hegarty, A. Aiken, P. Hanrahan, J. Vitek, Terra: A Multi-Stage Language for High-Performance Computing. https://cs.stanford.edu/~zdevito/pldi071-devito.pdf
Thanks, I actually hadn't read the Terra paper. I'll try to get to it this weekend. I think that using a Lua-like language as the meta-programming for your language (e.g. Regent) is an interesting approach, but it is not necessary.
For example, Scheme, Lisp, Haskell, D and Rust have shown that you can do meta-programming quite well in the same language you do normal programming in, without having to learn a completely different "meta-language".
I particularly like Rust procedural macros. It is normal run-time Rust code that just get first compiled, and then executed (with your source code as input) during compilation, with some twist to make that quite fast (e.g. the compiler parses an AST, and the proc macros do AST->AST folds).
If one wants to delay that until run-time, one should just do so (e.g. you just embed your compiler in your app, and when your proc macros run, its "run-time"). No need to layer multiple languages, but maybe the Terra paper convices me otherwise.