| [op here] To be clear: Yes, there are 3 kernels - you can see those in the linked github at the end of the article if you clicked that. These are: transpose_naive - Basic implementation with TMA transfers transpose_swizzle - Adds swizzling optimization for better memory access patterns transpose_swizzle_batched - Adds thread coarsening (batch processing) on top of swizzling Performance comparison with CUDA: The Mojo implementations achieve bandwidths of: transpose_naive: 1056.08 GB/s (32.0025% of max) transpose_swizzle: 1437.55 GB/s (43.5622% of max) transpose_swizzle_batched: 2775.49 GB/s (84.1056% of max) via the GitHub - simveit/efficient_transpose_mojo Comparing to the CUDA implementations mentioned in the article: Naive kernel: Mojo achieves 1056.08 GB/s vs CUDA's 875.46 GB/s Swizzle kernel: Mojo achieves 1437.55 GB/s vs CUDA's 1251.76 GB/s Batched swizzle kernel: Mojo achieves 2775.49 GB/s vs CUDA's 2771.35 GB/s So there is highly efficient matrix transpose in Mojo All three Mojo kernels outperform their CUDA counterparts, with the naive and swizzle kernels showing significant improvements (20.6% and 14.8% faster respectively), while the final optimized kernel achieves essentially identical performance (slightly better by 4.14 GB/s). The "flag" here seemed innapropriate given that its true this implementation is indeed faster, and certainly the final iteration could be improved on further. It wasn't wrong to say 14% or even 20%. |
Email the mods at hn@ycombinator.com. There's a chance they'll remove the flag and re-up the post.