Hacker News new | ask | show | jobs
by simon_vtr 375 days ago
The kernels I mention in CUDA use all the equivalent logic like the Mojo kernels. You can find them on my GitHub: https://github.com/simveit/effective_transpose You may want to provide a faster kernel on H100 via PR and I will merge after checking it’s faster.