|
|
|
|
|
by adgjlsfhk1
1020 days ago
|
|
My favorite example of big LMUL is matmul. You can do an entire gemm microkernel in like 8 instructions with LMUL=4 by using an 7x4 kernel. You have 32 that turn into 8 registers with LMUL=4, 7 of which end up storing your C values, 1 stores your A values and you put the B values in scalar registers. Thus your entire kernel ends up being 1 4 wide vector load load and 7 4x wide FMA instructions. |
|