Hacker News new | ask | show | jobs
by saagarjha 373 days ago
> This kernel archives a bandwidth of 1056.08 GB/s which is faster than the 875.46 GB/s we archived using CUDA. I believe that to be the reason because we use the PTX api for TMA transfers in Mojo.

I can't say for sure because I couldn't find the CUDA kernel but I kind of doubt this is true. You can hit memory bandwidth on Hopper without using TMA at all, which is mostly designed for accelerating asynchronous copies and reducing memory pressure. If all you are doing is a transpose you don't need any of this to go fast (though it might simplify your indexing code…?)

1 comments

The kernels I mention in CUDA use all the equivalent logic like the Mojo kernels. You can find them on my GitHub: https://github.com/simveit/effective_transpose You may want to provide a faster kernel on H100 via PR and I will merge after checking it’s faster.