| HN Mirror

The lack of understanding was obvious from the start. They didn't benchmark CPU only vs the GPU ported equivalent, which would be fair, since there is a lot of CPU code that benefits from being ported to CUDA.

They dishonestly thought that you can have GPU code that is faster than CUDA experts have written using extensive hardware knowledge, the intuition of which is unlikely to be in the training data and used in the generation of the kernel. The very thing they are attempting to do goes beyond what current generation LLMs can do.

The stated goal is also very silly. People don't need help running pytorch on CUDA. One of the most important fused kernels in machine learning is called flash attention and the reason why it can fuse operations has to do with the fact that flash attention is actually a very different algorithm to conventional attention that lets you reorder the operations, thereby lets you fuse them and happens to calculate an approximately similar but not quite the same result.