Hacker News new | ask | show | jobs
by dsharlet 894 days ago
This is really overstating how hard it is to compete with matrix multiply libraries. The main reason those libraries are so big and have had so much work invested in them is their generality: they're reasonably fast for almost any kind of inputs.

If you have a specific problem with constraints you can exploit (e.g. known fixed dimensions, sparsity patterns, data layouts, type conversions, etc.), it's not hard at all to beat MKL, etc... if you are using a language like C++. If you are using python, you have no chance.

It isn't even necessarily that different from a few nested loops. Clang is pretty damn good at autovectorizing, you just have to be a little careful about how you write the code.

4 comments

If you need a really fast compiled inner loop then you can often implement it in Python using Numba. Using that you can easily implement something like sparse matrix multiplication in Python.
> If you are using python, you have no chance.

Of course you do. Every special-case multiplication algorithm you might need already has an optimized implementation that you can just `pip install`, and move on with what you're actually working on.

The whole scientific computing world runs on Python. Straightforward numerics code using NumPy tends to murder C/C++ code in regard to performance, unless that code is written by people who make a living hand-optimizing computational routines.

> The whole scientific computing world runs on Python

If you ignore the majority of scientific code running on supercomputers doing most of science in C++ and Fortran.

Even in areas where python is used, the majority of the compute runs on C/C++/Fortran, with a little python as glue.

If you think numpy (written in c/c++) murders c/c++ code, you should learn about HPC, where really high performance happens. They don't use numpy.

> This is really overstating how hard it is to compete with matrix multiply libraries.

I'll file this under "talk is cheap". :) I tried it last year and got within 50% of BLAS. Getting above that is tons of work. Which you have to repeat for every processor model, NUMA, and every combination of matrix type (long thin, short wide, etc).

This gets to 90% of BLAS: https://github.com/dsharlet/array/blob/38f8ce332fc4e26af0832...

The less involved versions still get ~70%.

But this is also quite general. I’m claiming you can beat BLAS if you have some unique knowledge of the problem that you can exploit. For example, some kinds of sparsity can be implemented within the above example code yet still far outperform the more general sparsity supported by MKL and similar.

I don't believe you. OpenBLAS is multithreaded but the code you posted is single-threaded. The inner kernel isn't very well optimized either. So, no way.
I should have mentioned somewhere, I disabled threading for OpenBLAS, so it is comparing one thread to one thread. Parallelism would be easy to add, but I tend to want the thread parallelism outside code like this anyways.

As for the inner loop not being well optimized... the disassembly looks like the same basic thing as OpenBLAS. There's disassembly in the comments of that file to show what code it generates, I'd love to know what you think is lacking! The only difference between the one I linked and this is prefetching and outer loop ordering: https://github.com/dsharlet/array/blob/master/examples/linea...

So I actually tested your code: https://gist.github.com/bjourne/c2d0db48b2e50aaadf884e4450c6...

On my machine single-threaded OpenBLAS (called via NumPy) multiplies two single precision 4096x4096 matrices in 0.95 seconds. Your code takes over 30 seconds when compiled with clang++. And yes I used -O3, -march=native, and all that jazz. Btw, your code crashes g++ which doesn't necessarily mean that it is incorret, but it indicates that the code may be difficult for the compiler to optimize. For comparison, my own matrix multiplication code (https://github.com/bjourne/c-examples/blob/master/libraries/...) run in single-threaded mode takes 0.89 seconds. Which actually beats OpenBLAS, but OpenBLAS retakes the lead for larger arrays when multi-threading is added. You can look at my code for how to write a decent inner kernel. Writing it in pure C without intrinsics and hoping that the compiler will optimize it definitely will not work.

It also is not true that "Parallelism would be easy to add". Unless your algorithm is designed from the start to exploit multi-threading, attempting to bolt it on later will not yield good performance.

The makefile asks for -O2 with clang. I find that -O3 almost never helps in clang. (In gcc it does.)

Here's what I see:

   $ clang++ --version
   clang version 18.0.0

   $ time make bin/matrix
   mkdir -p bin
   clang++ -I../../include -I../ -o bin/matrix matrix.cpp  -O2 -march=native -ffast-math -fstrict-aliasing -fno-exceptions -DNDEBUG -DBLAS  -std=c++14 -Wall -lstdc++ -lm -lblas
   1.25user 0.29system 0:02.74elapsed 56%CPU (0avgtext+0avgdata 126996maxresident)k
   159608inputs+120outputs (961major+25661minor)pagefaults 0swaps

   $ bin/matrix
   ...
   reduce_tiles_z_order time: 3.86099 ms, 117.323 GFLOP/s
   blas time: 0.533486 ms, 849.103 GFLOP/s

   $ OMP_NUM_THREADS=1 bin/matrix
   ...
   reduce_tiles_z_order time: 3.89488 ms, 116.303 GFLOP/s
   blas time: 3.49714 ms, 129.53 GFLOP/s
My inner loop in perf: https://gist.github.com/dsharlet/5f51a632d92869d144fc3d6ed6b... BLAS inner loop in perf (a chunk of it, it is unrolled massively): https://gist.github.com/dsharlet/5b2184a285a798e0f0c6274dc42...

Despite being on a current-ish version of clang, I've been getting similar results from clang for years now.

Anyways, I'm not going to debate any further. It works for me :) If you want to keep writing code the way you have, go for it.

You need to have enough experience to be able to be a little careful though. This is generally true for most languages, loop unrolling works even better in Python for example, but many Python programmers aren't even aware of this possibility.