|
|
|
|
|
by microtonal
1714 days ago
|
|
On AMD hardware, I don't understand why people avoid AMD's support, which is just a version of BLIS and libflame. A year ago, I benchmarked a transformer network with libtorch linked against various BLAS libraries (numbers are in sentences per second, MKL with CPU detection override on AMD, 4 threads): Ryzen 3700X - OpenBLAS: 83, BLIS: 69, AMD BLIS: 80, MKL: 119 Xeon Gold 6138 - OpenBLAS: 88, BLIS: 52, AMD BLIS: 59, MKL: 128 I guess people avoid AMD's support, because MKL is just much faster? AMD BLIS did add batch GEMM support since then. Didn't have time to try that out yet. |
|
We don't know what that example was actually measuring, except apparently not the same thing for BLIS and MKL. On the basis of only that, it's not reasonable to say "just so much faster", in particular for what I care about. I have Zen2 measurements (unfortunately only in a VM) using the BLIS test/3 framework. MKL came out nearly as fast as vanilla BLIS 0.7 and OpenBLAS on serial DGEMM, less so on the rest of D level 3, and nowhere close with S, C, and Z. Similarly for one- and two-socket OpenMP. At least in that "2021" version of MKL, there's only a Zen DGEMM kernel.