|
|
|
|
|
by saagarjha
379 days ago
|
|
Unfortunately the issue (alluded to in the blog post you linked) is that transposes do absolutely no work but memory loads. Sure, they test that you can swizzle your accesses, but modern accelerators are all about pipelining and feeding matrix multiply units, which is considerably harder than loading from memory as fast as possible. Actually, even the Mojo post barely beats CUDA for most of its kernels, because you can hit memory bandwidth for transpose on the latest hardware using techniques from 5-10 years ago. This is definitely not true for more interesting operations. |
|
Jay Shah’s later articles contain examples that involve epilogue fusion. IMHO, understanding how to write an efficient transpose helps with following the more involved ones.