| HN Mirror

CUTLASS, which is NVIDIA’s C++ template library for writing matrix multiply and convolution kernels parametrized over input/output types, operators, and algorithm block sizes, theoretically supports this. But, each input of the (k, n) shaped matrix B will be read from global memory ceil(n / block dimension) times in an algorithm that computes one (block dimension, block dimension) submatrix of the output matrix D per thread block. It will probably be more efficient to cast your B matrix to FP16 or INT8 lower precision in a preprocessing kernel to reduce memory traffic in the matrix multiply kernel.

On newer GPUs, though, we have this huge L2 cache which makes the calculus a little different if your working set fits into it. e.g. Ampere A100 has 40MB L2$.