Hacker News new | ask | show | jobs
by mcabbott 1527 days ago
On the GPU, operations which look like matrix multiplication are indeed quite slow. As you say there are many tricks, and Tullio doesn't (right now) know them. For operations which are more like broadcasting it does much better.

On the CPU, the situation is much better. With some help from LoopVectorization.jl (which optimises micro-kernels) it will often beat OpenBLAS at matrix multiplication. The best-case scenario is an operation which would otherwise be permutedims plus matrix multiplication, for which it will often be several times faster, by fusing these.

The notation above is shared by some other packages. TensorOperations.jl is always decomposes to known kernels (including on the GPU) and OMEinsum.jl usually does so (with a fallback to loops), both more like einsum. TensorCast.jl is more like einops, just notation for writing reshape/permute/slice operations.