|
|
|
|
|
by stephantul
483 days ago
|
|
This was interesting to see happen live on X I think putting this on the llm is a bit generous. Their results were apparently 30x above the theoretical maximum, according to gpu master Tri Dao, so there was also a lack of understanding on what was possible with CUDA. See: https://x.com/tri_dao/status/1892610951662153945 Also see this thread by Lucas Beyer: https://x.com/giffmana/status/1892510741242036468 One of the greatest skills in research is to remain skeptical of one’s own results, especially when they are exceptional. They chose to pull the trigger, and release too quickly. This can happen in any setting, not just codegen, e.g. inadvertently training on the test set. Science is a slow ascent: if it looks too good to be true, you probably just have a bug. |
|
They dishonestly thought that you can have GPU code that is faster than CUDA experts have written using extensive hardware knowledge, the intuition of which is unlikely to be in the training data and used in the generation of the kernel. The very thing they are attempting to do goes beyond what current generation LLMs can do.
The stated goal is also very silly. People don't need help running pytorch on CUDA. One of the most important fused kernels in machine learning is called flash attention and the reason why it can fuse operations has to do with the fact that flash attention is actually a very different algorithm to conventional attention that lets you reorder the operations, thereby lets you fuse them and happens to calculate an approximately similar but not quite the same result.