|
|
|
|
|
by rnrn
478 days ago
|
|
As I write this (after the updates to the evaluation code), https://pub.sakana.ai/ai-cuda-engineer/kernel/2/23/optimize-... is on their top of their list of speedups, with a claim of 128x speed up on a fused 3D convolution + groupnorm + mean. The generated implementation doesn’t do a convolution. The 2nd kernel on the leaderboard also appears to be incorrect, with a bunch of dead code computing a convolution and then not using it and writing tanhf(1.0f) * scaling_factor for every output. |
|