|
|
|
|
|
by gnufx
2956 days ago
|
|
It's a factor of three for the large-matrix serial case on KNL -- the OpenBLAS issue on KNL -- whereas you might expect a factor of two by analogy with avx/avx2. For avx512 (and maybe other x86_64, which is now dynamically dispatched) large BLAS, use BLIS. BLIS also provides a non-BLAS interface. For small matrix multiplication, use libxsmm, of course. Remember that the world isn't all amd64/x86_64, in which case BLIS is infinitely faster than MKL, and it's probably faster even on Bulldozer/Zen. (I haven't compared on Bulldozer recently, and don't have Zen.) |
|
Thanks for the heads-up RE: BLIS, I'd forgotten about them; it's probably the best option, especially considering its open source status.
[0] https://github.com/xianyi/OpenBLAS/issues/991#issuecomment-3...