|
For programs running on the CPU, the programmer cannot manually control memory transfers across the memory hierarchy, so adding that ability to your programming language does not solve any problems. So I'd say most languages don't expose it because there is no need for it. Many languages for GPUs do expose the memory hierarchy to programmers (e.g. CUDA, OpenCL, even OpenAcc). > What is the core problem that has not yet been solved? That using the memory hierarchy efficiently and correctly when writing GPU programs in those languages is hard and error prone. It is trivial to write code that performs horribly and/or has data-races and other forms of UB in CUDA, e.g., when dealing with global/shared/local memory. Sequoia attempted to split the kernel in the algorithms at the different levels of the memory hierarchy, and the memory owned by the kernel at the different levels, as well as how to split this memory as you "refine" the kernel, e.g., from kernel (global) -> thread blocks (global, shared, constant) -> warps (global, shared, constant, shared registers) -> threads (global, shared, constant, shared registers, local memory). For many algorithms (e.g. GEMM, convolutions), how you partition global memory into thread blocks, and which parts of global memory one loads into shared memory and how, has a huge impact on performance. |
- Programmers being unable to control caches, at least directly, and reliably.
- Languages (e.g. C/C++) having no direct way of expressing memory constraints.
This suggests to me that even in CPU programming there is something important missing, and I imagine that a suitable explict representation of the memory hierarchy might be it. A core problem is that its unclear how to abstract a program so it remains perfomant over different memory hierarchies.