|
|
|
|
|
by Const-me
590 days ago
|
|
Couple years ago, I wanted about the same thing in HLSL language, for a Direct3D 11.0 compute shader. Here’s the fastest version I managed to make back then: https://github.com/Const-me/Cgml/blob/master/Mistral/Mistral... As you see, I have implemented 32×32 tiling, using thread groups of 32×8 threads, two groupshared buffers to load tiles of the input matrices, and I accumulate numbers into local variables, 32 / 8 = 4 accumulators per thread. |
|