Yeah this is confusing for me: I'm non an expert in numpy * but I had assumed that it would do most of those things - vectorize, unroll, etc, either when compiled or through any backend it's using. I understand that numpy's routines are fixed and that mojo might have more flexibility, but for straight up matrix multiplication I'd be very surprised if it's really leaving that much performance on the table. Although I can appreciate that if it depends separately on what BLAS backend has been installed that is a barrier to getting default fast performance.
* For context I do have done some experience experimenting on the gcc/intel compiler options that are available for linear algebra, and even outside of BLAS, compiling with -o3 -ffast-math -funroll-loops etc does a lot of that, and for simple loops as in matrix vector multiplication, compilers can easily vectorize. I'm very curious if there is something I don't know about that will result in a speedup. See e.g. https://gist.github.com/rbitr/3b86154f78a0f0832e8bd171615236... for some basic playing around
I'm not sure where/how they'd be squeezing out more performance unless its better compilation/compatibility with Apple Silicon intrinsics.
Edit: ..Is Mojo using more than 1 core? I'm not sure I understand their syntax and if they are parallel constructs.
Edit2: Yeah Mojo seems to be parallelizing, so the comparison really isn't fair. The np.config posted elsewhere shows that OpenBLAS is only compiled with MAX_THREADS=3 support, and its not clear what their OPENBLAS_NUM_THREADS/OPENMP_NUM_THREADS was set to at runtime.
I'm not super familiar with Mac but I also notice that numpy here is using openblas64. I had thought the go-to was the Accelerate framework? Or is that part of it somehow? If so it would be interesting to see how that impacts performance. Of course it's all kind of an argument for something like Mojo that gives better performance out of the box. Also an argument for why Mojo would be way more interesting if it was open source.
Just whatever you get by default with pip install numpy... Changing the benchmark to run with a 1024x1024x1024 matrix instead of a 128x128x128 does speed up numpy significantly though
Python 119.189 GFLOPS
Naive: 6.275 GFLOPS 0.05x faster than Python
Vectorized: 22.259 GFLOPS 0.19x faster than Python
Parallelized: 50.258 GFLOPS 0.42x faster than Python
Tiled: 59.692 GFLOPS 0.50x faster than Python
Unrolled: 62.165 GFLOPS 0.52x faster than Python
Accumulated: 565.240 GFLOPS 4.74x faster than Python
If you are looking for improved performance, you will always go with NumPy + vectorization. That's what is important. So I don't know what is the argument here, am I missing something?
* For context I do have done some experience experimenting on the gcc/intel compiler options that are available for linear algebra, and even outside of BLAS, compiling with -o3 -ffast-math -funroll-loops etc does a lot of that, and for simple loops as in matrix vector multiplication, compilers can easily vectorize. I'm very curious if there is something I don't know about that will result in a speedup. See e.g. https://gist.github.com/rbitr/3b86154f78a0f0832e8bd171615236... for some basic playing around