|
|
|
|
|
by jiggawatts
535 days ago
|
|
Merged two days ago!? That’s about half a decade after they should have done this foundational work! I guess it’s better late than never, but in this case a timely implementation was worth about a trillion dollars… maybe two. |
|
https://salykova.github.io/matmul-cpu
Concidentally, the Intel MKL also outperforms OpenBLAS, so there being room for improvement is well known. That said, I have a GEMV implementation that outperforms both the Intel MKL and OpenBLAS in my tests on Zen 3:
https://github.com/ryao/llama3.c/blob/master/run.c#L429
That is unless you shoehorn GEMV into the Intel MKL's batched GEMM function, which then outperforms it when there is locality. Of course, when there is no locality, my code runs faster.
I suspect if/when this reaches the established amd64 BLAS implementations' authors, they will adopt my trick to get their non-batched GEMV implementations to run fast too. In particular, I am calculating the dot products for 8 rows in parallel followed by 8 parallel horizontal additions. I have not seen the 8 parallel horizontal addition technique mentioned anywhere, so I might be the first to have done it.