Hacker News new | ask | show | jobs
by hellepardo 2956 days ago
Armadillo is built on any BLAS you might like to use. (Eigen is a hand-implemented replacement for BLAS, by the way.) So I, too, am interested in what BLAS implementation is used here. OpenBLAS vs. regular LAPACK/BLAS (or ATLAS) can make quite a large difference.
1 comments

Eigen, Armadillo, Blaze, and ETL all have their own replacement implementations for BLAS but can be linked against any version. By the way, MKL supports AVX512, while OpenBLAS does not as of yet. Benchmarks show a factor of 4 between the two for gemm.
It's a factor of three for the large-matrix serial case on KNL -- the OpenBLAS issue on KNL -- whereas you might expect a factor of two by analogy with avx/avx2.

For avx512 (and maybe other x86_64, which is now dynamically dispatched) large BLAS, use BLIS. BLIS also provides a non-BLAS interface. For small matrix multiplication, use libxsmm, of course.

Remember that the world isn't all amd64/x86_64, in which case BLIS is infinitely faster than MKL, and it's probably faster even on Bulldozer/Zen. (I haven't compared on Bulldozer recently, and don't have Zen.)

I was looking at [0] for that number. You're right, it's closer to 3 than 4; I must have rounded ~12k down to 10k and 37k up to 40k. I could imagine some other factors speeding it up further, as well. There were a number of missing instructions in AVX2 that they've filled in for AVX512 which could play a role.

Thanks for the heads-up RE: BLIS, I'd forgotten about them; it's probably the best option, especially considering its open source status.

[0] https://github.com/xianyi/OpenBLAS/issues/991#issuecomment-3...