| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by saagarjha 379 days ago
	Unfortunately the issue (alluded to in the blog post you linked) is that transposes do absolutely no work but memory loads. Sure, they test that you can swizzle your accesses, but modern accelerators are all about pipelining and feeding matrix multiply units, which is considerably harder than loading from memory as fast as possible. Actually, even the Mojo post barely beats CUDA for most of its kernels, because you can hit memory bandwidth for transpose on the latest hardware using techniques from 5-10 years ago. This is definitely not true for more interesting operations.

1 comments

musebox35 379 days ago

I totally agree that the resulting kernel will be rarely useful. I just wanted to highlight that it is a commonly used educational exercise to showcase how to optimize for memory throughput. If the post showed how to fuse a transpose + rmsnorm epilogue to a gemm then the kernel would be more functional but the blog post would be much harder to follow for newcomers.

Jay Shah’s later articles contain examples that involve epilogue fusion. IMHO, understanding how to write an efficient transpose helps with following the more involved ones.

link

saagarjha 370 days ago

It's less that the result is kind of useless and more that hitting memory throughput on a simple algorithm like this is not very difficult. It takes a complex example to actually have trouble doing this.

link

simon_vtr 379 days ago

That was exactly my reason to write this blogpost and optimise transpose. It is a simple educational yet not trivial example to learn the basics.

link