|
|
|
|
|
by brrrrrm
808 days ago
|
|
using AVX/FMA and unrolling loops does extremely little in the way of compiling to fast (>80% peak) GEMM code. These are very much intro steps that don't take into account many important ideas related to cache hierarchy, uop interactions, and even instruction decode time. The Fortran implementation is entirely and unquestionably inadequate for real high performance GEMMs. |
|
Anyway, I have come to eat crow. Thank you for your insight and helping me to get a much better perspective on this problem. I mostly work with scalar and vector updates, and do not work with matrices very often.