|
|
|
|
|
by jsnell
373 days ago
|
|
The "Switching to Mojo gave a 14% improvement over CUDA" title is editorialized, the original is "Highly efficient matrix transpose in Mojo". Also, the improvement is 0.14%, not 14% making the editorialized linkbait particularly egregious. |
|
transpose_naive - Basic implementation with TMA transfers
transpose_swizzle - Adds swizzling optimization for better memory access patterns
transpose_swizzle_batched - Adds thread coarsening (batch processing) on top of swizzling
Performance comparison with CUDA: The Mojo implementations achieve bandwidths of:
transpose_naive: 1056.08 GB/s (32.0025% of max)
transpose_swizzle: 1437.55 GB/s (43.5622% of max)
transpose_swizzle_batched: 2775.49 GB/s (84.1056% of max)
via the GitHub - simveit/efficient_transpose_mojo
Comparing to the CUDA implementations mentioned in the article:
Naive kernel: Mojo achieves 1056.08 GB/s vs CUDA's 875.46 GB/s
Swizzle kernel: Mojo achieves 1437.55 GB/s vs CUDA's 1251.76 GB/s
Batched swizzle kernel: Mojo achieves 2775.49 GB/s vs CUDA's 2771.35 GB/s
So there is highly efficient matrix transpose in Mojo
All three Mojo kernels outperform their CUDA counterparts, with the naive and swizzle kernels showing significant improvements (20.6% and 14.8% faster respectively), while the final optimized kernel achieves essentially identical performance (slightly better by 4.14 GB/s).
The "flag" here seemed innapropriate given that its true this implementation is indeed faster, and certainly the final iteration could be improved on further. It wasn't wrong to say 14% or even 20%.