|
|
|
|
|
by llwu
938 days ago
|
|
Question on the "Batching memory-bound processes on a GPU" section - it says "This enables us to reuse parts of the model that we’ve already loaded into the GPU’s SRAM", but the 10 GB we are loading is into the HBM, right? How did we overcome the HBM <-> SRAM bottleneck? More generally, how can we find out the size of the SRAM? |
|
You can calculate the SRAM as follows: an A100 has 108 SMs, and each SM has 192 KB in SRAM (shared memory, aka its L1 cache) [1]. Multiplied out, this is ~20 MB of total SRAM. This happens to match up with the diagram in the Flash Attention paper [2].
[1] https://developer.nvidia.com/blog/cuda-refresher-cuda-progra...
[2] https://arxiv.org/pdf/2205.14135.pdf