Hacker News new | ask | show | jobs
by costrouc 2948 days ago
Matrix multiplication involves moving across one matrix in column order and the other matrix in row order. So it turns out that both row or column ordering make no difference. I think that matrix multiplication is one of the best examples of a deceptivly simple problem. It shows how far ALL code is from peak performance. We can only strive to be within an order or two from peak performance.
2 comments

This trick does work. If the matrices are in row-major order, you transpose B in memory and then compute A * (B^T)^T. This multiplication reads both matrices in row order.

However, while this does improve performance over the naive algorithm, it's still not as good as a tiling algorithm.

I've found where that causes problems is when on of the matrix dimensions is not a multiple of the cache line size. It's common on gpus to use more elements then there are in the dimension. Nvidia calls this the leading dimension, and it must be greater than or equal to the Matrix dimension. If this is the case, the transpose trick doesn't quite work anymore.
I took a different lesson from learning to implement matrix multiply: you can get a small amount of code within a few percent of peak performance if you try. So, the trick is to keep the 99% of the code that does 10% of the work within an order of magnitude of peak performance, and then get the rest within a factor of two.

That gets the system within a factor of three, which is still embarrassing, but that’s what profiling tools are for. :-)