Hacker News new | ask | show | jobs
by gnufx 2956 days ago
It's a factor of three for the large-matrix serial case on KNL -- the OpenBLAS issue on KNL -- whereas you might expect a factor of two by analogy with avx/avx2.

For avx512 (and maybe other x86_64, which is now dynamically dispatched) large BLAS, use BLIS. BLIS also provides a non-BLAS interface. For small matrix multiplication, use libxsmm, of course.

Remember that the world isn't all amd64/x86_64, in which case BLIS is infinitely faster than MKL, and it's probably faster even on Bulldozer/Zen. (I haven't compared on Bulldozer recently, and don't have Zen.)

1 comments

I was looking at [0] for that number. You're right, it's closer to 3 than 4; I must have rounded ~12k down to 10k and 37k up to 40k. I could imagine some other factors speeding it up further, as well. There were a number of missing instructions in AVX2 that they've filled in for AVX512 which could play a role.

Thanks for the heads-up RE: BLIS, I'd forgotten about them; it's probably the best option, especially considering its open source status.

[0] https://github.com/xianyi/OpenBLAS/issues/991#issuecomment-3...