|
|
|
|
|
by gyrovagueGeist
977 days ago
|
|
Even OpenBLAS (the default iiuc) does all of that and more to optimize for different levels of the cache hierarchy: https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf I'm not sure where/how they'd be squeezing out more performance unless its better compilation/compatibility with Apple Silicon intrinsics. Edit: ..Is Mojo using more than 1 core? I'm not sure I understand their syntax and if they are parallel constructs. Edit2: Yeah Mojo seems to be parallelizing, so the comparison really isn't fair. The np.config posted elsewhere shows that OpenBLAS is only compiled with MAX_THREADS=3 support, and its not clear what their OPENBLAS_NUM_THREADS/OPENMP_NUM_THREADS was set to at runtime. |
|