|
|
|
|
|
by kettleballroll
1714 days ago
|
|
In a previous life, almost a decade ago, I fought very similar fights with OpenMP and MKL using R. It's painful and you need to pay heed to all these small details pointed out in the docs as in OPs case. However, it's worth noting that OpenBLAS is as fast as MKL, at least if you compile it yourself for your system (i would expect that system provided ones with system detection would be as good, but that wasn't always the case back then). I benched this extensively for all my R usecases and for several systems that i cared about back then. So there is usually no need to use MKL in the first place. |
|
OpenBLAS is incompatible with application threads. Most Linux distributions provide a multi-threaded OpenBLAS that burns in a fire if you use it in multi-threaded applications. Even though OpenBLAS' performance is great, I'd be careful to give a general recommendation for people to rely on OpenBLAS. Like this MKL example, you have to be aware of its threading issues, read the documentation and compile it with the right flags (in a multi-threaded application: single-threaded, but with locking).
it's worth noting that OpenBLAS is as fast as MKL
This depends highly on the application. E.g. MKL provides batch GEMM, which is used by libraries like PyTorch. So if you use PyTorch for machine learning, performance is still much better with MKL. Of course, that is if you do not have an AMD CPU. If you have an AMD CPU, you have to override Intel CPU detection if you do not want abysmal performance:
https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html
https://www.agner.org/optimize/blog/read.php?i=49
The BLAS/LAPACK ecosystem is a mess. I wish that Intel would just open source MKL and properly support AMD CPUs.