|
|
|
|
|
by 9q9
2251 days ago
|
|
Interestingly, a lot of programmers (who program for the CPU) worry
about caches a great deal, despite: - Programmers being unable to control caches, at least directly, and
reliably. - Languages (e.g. C/C++) having no direct way of expressing memory
constraints. This suggests to me that even in CPU programming there is something
important missing, and I imagine that a suitable explict
representation of the memory hierarchy might be it. A core problem is
that its unclear how to abstract a program so it remains perfomant
over different memory hierarchies. |
|
On different CPUs (e.g. with different cache sizes), these loops need to be tiled differently. If you want a single binary to perform well across the board, it just need to support using the different tile-sizes depending on the hardware.
Usually, this is however not enough. For example, you have some data in memory (a vector of f16s on an intel CPU), and for operating on them, you need to decompress that first into a vector of f32s.
You probably want to decompress to fill the cache, operate, recompress, to save memory and memory bandwidth. For that you need a "scratchpad" (or __shared__ memory in CUDA), e.g., for the different levels of the cache hierarchy.
The compiler needs to know, for the different architectures, what their L1/L2/L3/shared/constant/texture/.... memory sizes are, and either fill these sizes for you to use the whole cache, or let the user pick them, making the program fail if run on hardware where this isn't correct. NVCC (and pretty much every production compiler) can bundle code for different architectures within a single binary, and pick the best at run-time.
So if your L3 can be 4,8,16, or 32 Mb, your compiler can bundle 4 copies of your kernel, query the CPU cache size at initialization, and be done with it.
The way in which you abstract the memory hierarchy is by just partitioning your problem's memory.
If you have a f16s vector in global memory, you might want to partition it into N different chunks, e.g., recursively, until if a chunk where to be decompressed into f32s, that decompressed chunk would fit into a faster memory (e.g. L3). At that point you might want to do the decompression, and continue partitioning the f32s up to some threshold (e.g. the L1), on which you operate.
That is, a kernel has multiple pieces:
- a state initialization (e.g. owns the global memory)
- partitioning: how to partition that into smaller subproblems
- the merge: what gets computed at each level of the partitioning (e.g. how are the results of the partitions merged together to compute the final result)
- the leaf computation: what gets computed at the "finest" level
The programmer doesn't know a priori how many partitions would be necessary, that would depend on the memory hierarchy. But it does know everything else.
For example, for a sum reduction:
- the programmer knows how to perform a leaf computation: by summing a whole leaf partition and putting the result somewhere in the next larger memory level.
- the programmer knows how to merge computations: by summing the results of the current level, and putting them in the next level of the hierarchy (inter-warp -> inter-block -> inter-grid, or L1 -> L2 -> L3 -> RAM) - the programmer knows how to partition an input vector into N pieces (N CUDA threads, N warps, N blocks, N grids, N cpu threads, N numa domains, etc.)A good programming model needs to allow the user to express what they know, and abstract away what they cannot know (how big the caches are, how many cache levels, how many threads of execution, etc.)