|
|
|
|
|
by chessgecko
552 days ago
|
|
I remember reading that it’s too hard to get good memory bandwidth/l2 utilization in the fancy algorithms, you need to read contiguous blocks and be able to use them repeatedly. But I also haven’t looked at the gpu blas implementations directly. |
|