Hacker News new | ask | show | jobs
by fdej 2948 days ago
This trick does work. If the matrices are in row-major order, you transpose B in memory and then compute A * (B^T)^T. This multiplication reads both matrices in row order.

However, while this does improve performance over the naive algorithm, it's still not as good as a tiling algorithm.

1 comments

I've found where that causes problems is when on of the matrix dimensions is not a multiple of the cache line size. It's common on gpus to use more elements then there are in the dimension. Nvidia calls this the leading dimension, and it must be greater than or equal to the Matrix dimension. If this is the case, the transpose trick doesn't quite work anymore.