| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by stabbles 1911 days ago

I think it's more interesting to see what people do with the language instead of focusing on microbenchmarks. There's for instance this great package https://github.com/JuliaSIMD/LoopVectorization.jl which exports a simple macro `@avx` which you can stick to loops to vectorize them in ways better than the compiler (=LLVM). It's quite remarkable you can implement this in the language as a package as opposed to having LLVM improve or the julia compiler team figure this out.

See the docs which kinda read like blog posts: https://juliasimd.github.io/LoopVectorization.jl/stable/

And then replacing the matmul.jl with the following:

    @avx for i = 1:m, j = 1:p
        z = 0.0
        for k = 1:n
            z += a[i, k] * b[k, j]
        end
        out[i, j] = z
    end

I get a 4x speedup from 2.72s to 0.63s. And with @avxt (threaded) using 8 threads it goes town to 0.082s on my amd ryzen cpu. (So this is not dispatching to MKL/OpenBLAS/etc). Doing the same in native Python takes 403.781s on this system -- haven't tried the others.