|
|
|
|
|
by stabbles
1911 days ago
|
|
I think it's more interesting to see what people do with the language instead of focusing on microbenchmarks. There's for instance this great package https://github.com/JuliaSIMD/LoopVectorization.jl which exports a simple macro `@avx` which you can stick to loops to vectorize them in ways better than the compiler (=LLVM). It's quite remarkable you can implement this in the language as a package as opposed to having LLVM improve or the julia compiler team figure this out. See the docs which kinda read like blog posts: https://juliasimd.github.io/LoopVectorization.jl/stable/ And then replacing the matmul.jl with the following: @avx for i = 1:m, j = 1:p
z = 0.0
for k = 1:n
z += a[i, k] * b[k, j]
end
out[i, j] = z
end
I get a 4x speedup from 2.72s to 0.63s. And with @avxt (threaded) using 8 threads it goes town to 0.082s on my amd ryzen cpu. (So this is not dispatching to MKL/OpenBLAS/etc). Doing the same in native Python takes 403.781s on this system -- haven't tried the others. |
|