|
|
|
|
|
by steppi
804 days ago
|
|
Check out The Science of Programming Matrix Computations by Robert A. van de Geijn and Enrique S. Quintana-Ort. Chapter 5 walks through how to write an optimized GEMM. It involves clever use of block multiplication, choosing block sizes for optimal cache behavior for specific chips. Modern compilers just aren't able to do such things now. I've spent a little time debugging things in scipy.linalg by swapping out OpenBLAS with reference BLAS and have found the slowdown from using reference BLAS is typically at least an order of magnitude. [0] https://www.cs.utexas.edu/users/rvdg/tmp/TSoPMC.pdf |
|