| OpenBLAS OpenBLAS is incompatible with application threads. Most Linux distributions provide a multi-threaded OpenBLAS that burns in a fire if you use it in multi-threaded applications. Even though OpenBLAS' performance is great, I'd be careful to give a general recommendation for people to rely on OpenBLAS. Like this MKL example, you have to be aware of its threading issues, read the documentation and compile it with the right flags (in a multi-threaded application: single-threaded, but with locking). it's worth noting that OpenBLAS is as fast as MKL This depends highly on the application. E.g. MKL provides batch GEMM, which is used by libraries like PyTorch. So if you use PyTorch for machine learning, performance is still much better with MKL. Of course, that is if you do not have an AMD CPU. If you have an AMD CPU, you have to override Intel CPU detection if you do not want abysmal performance: https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html https://www.agner.org/optimize/blog/read.php?i=49 The BLAS/LAPACK ecosystem is a mess. I wish that Intel would just open source MKL and properly support AMD CPUs. |
Can you explain what you mean by this? Are you saying there's a correctness issue here? I only recall running into issues with MPI, where you (typically) run one MPI rank (process) per CPU core. Then if you combine that with a multi-threaded BLAS library you'll suddenly have N^2 BLAS threads fighting over the CPU's and performance goes down the drain. The solution to this is, like you say, to use a single-threaded OpenBLAS, or then the OpenMP OpenBLAS and set OMP_NUM_THREADS=1
I guess with threads you'll have the same issue if you launch N cpu-bound threads and all those call BLAS, resulting in the same N^2 issue as you see with MPI.