|
|
|
|
|
by simon_vtr
375 days ago
|
|
The kernels I mention in CUDA use all the equivalent logic like the Mojo kernels. You can find them on my GitHub: https://github.com/simveit/effective_transpose
You may want to provide a faster kernel on H100 via PR and I will merge after checking it’s faster. |
|