|
|
|
|
|
by jaberjaber23
153 days ago
|
|
Six months ago, if you asked me whether an LLM could write a CUDA kernel that actually beats PyTorch's compiler, I would have said no. The optimization space is too complex. Too many hardware details. Too easy to write something that compiles but runs slower than the baseline I was wrong!! We're now seeing multi-agent systems that take your PyTorch code and spit out CUDA or Triton kernels with 2x to 14x speedups over torch.compile(mode='max-autotune-no-cudagraphs'). Not on toy benchmarks. On real models like Llama-3.1-8B, Whisper, and Stable Diffusion |
|