Hacker News new | ask | show | jobs
by chessgecko 552 days ago
I remember reading that it’s too hard to get good memory bandwidth/l2 utilization in the fancy algorithms, you need to read contiguous blocks and be able to use them repeatedly. But I also haven’t looked at the gpu blas implementations directly.