Hacker News new | ask | show | jobs
by gyrovagueGeist 977 days ago
Even OpenBLAS (the default iiuc) does all of that and more to optimize for different levels of the cache hierarchy: https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf

I'm not sure where/how they'd be squeezing out more performance unless its better compilation/compatibility with Apple Silicon intrinsics.

Edit: ..Is Mojo using more than 1 core? I'm not sure I understand their syntax and if they are parallel constructs.

Edit2: Yeah Mojo seems to be parallelizing, so the comparison really isn't fair. The np.config posted elsewhere shows that OpenBLAS is only compiled with MAX_THREADS=3 support, and its not clear what their OPENBLAS_NUM_THREADS/OPENMP_NUM_THREADS was set to at runtime.

1 comments

I'm not super familiar with Mac but I also notice that numpy here is using openblas64. I had thought the go-to was the Accelerate framework? Or is that part of it somehow? If so it would be interesting to see how that impacts performance. Of course it's all kind of an argument for something like Mojo that gives better performance out of the box. Also an argument for why Mojo would be way more interesting if it was open source.