Hacker News new | ask | show | jobs
by gnufx 1722 days ago
Debian and Fedora provide serial, OpenMP, and pthreads versions of lilbopenblas. Are you sure OpenBLAS doesn't detect nested OpenMP? I thought it did, though I'd normally use the serial version outside something like R, but if you mix different low-level simple pthreads with high-level OpenMP, you can expect problems. OpenBLAS is fine generally -- competitive with MKL on Intel hardware and infinitely faster on ARM and POWER. For PyTorch, presumably you want libxsmm (which is responsible for MKL's current small matrix performance). On AMD hardware, I don't understand why people avoid AMD's support, which is just a version of BLIS and libflame. (BLIS' OpenMP story seems better than OpenBLAS'.) The linear algebra story on GNU/Linux distributions would be less of a mess without proprietary libraries like MKL. It's fine if you take the Debian approach, in significant experience running heterogeneous HPC systems. Fedora has cocked up policy through not listening to such experience, but you can do the Debian-style thing with the approach of https://loveshack.fedorapeople.org/blas-subversion.html (and see the old R example refuting the MKL story). That's one example of the value of dynamic linking.
1 comments

On AMD hardware, I don't understand why people avoid AMD's support, which is just a version of BLIS and libflame.

A year ago, I benchmarked a transformer network with libtorch linked against various BLAS libraries (numbers are in sentences per second, MKL with CPU detection override on AMD, 4 threads):

Ryzen 3700X - OpenBLAS: 83, BLIS: 69, AMD BLIS: 80, MKL: 119

Xeon Gold 6138 - OpenBLAS: 88, BLIS: 52, AMD BLIS: 59, MKL: 128

I guess people avoid AMD's support, because MKL is just much faster? AMD BLIS did add batch GEMM support since then. Didn't have time to try that out yet.

I was thinking of the usual complaint about Intel not supporting AMD hardware that is common in HPC.

We don't know what that example was actually measuring, except apparently not the same thing for BLIS and MKL. On the basis of only that, it's not reasonable to say "just so much faster", in particular for what I care about. I have Zen2 measurements (unfortunately only in a VM) using the BLIS test/3 framework. MKL came out nearly as fast as vanilla BLIS 0.7 and OpenBLAS on serial DGEMM, less so on the rest of D level 3, and nowhere close with S, C, and Z. Similarly for one- and two-socket OpenMP. At least in that "2021" version of MKL, there's only a Zen DGEMM kernel.

Have you set the environment variable OPENBLAS_CORETYPE to specify the CPU?
I went further than that, I profiled with perf and checked that the right kernels were used.