| HN Mirror

julia> using LinearAlgebra julia> BLAS.vendor() :openblas64 julia> BLAS.set_num_threads(1) julia> peakflops() 3.9023447970402664e10 julia> using LinearAlgebra julia> BLAS.vendor() :mkl julia> BLAS.set_num_threads(1) julia> peakflops() 4.8113846984735275e10

The plot thickens. As I reported elsewhere in the thread, the slow code paths were selected on my machine, unless I override the mkl_serv_intel_cpu_true function to always return true. However, this was with PyTorch.

I have now also compiled the ACE DGEMM benchmark and linked against MKL iomp:

    $ ./mt-dgemm 1000 | grep GFLOP
    GFLOP/s rate:         69.124168 GF/s

Most-used function is

   mt-dgemm  libmkl_def.so       [.] mkl_blas_def_dgemm_kernel_zen

So, it is clearly using a GEMM kernel. Now I wonder what is different between PyTorch and this simple benchmark, causing PyTorch to result in a slow SSE code path.