| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jaberjaber23 153 days ago

Six months ago, if you asked me whether an LLM could write a CUDA kernel that actually beats PyTorch's compiler, I would have said no. The optimization space is too complex. Too many hardware details. Too easy to write something that compiles but runs slower than the baseline

I was wrong!!

We're now seeing multi-agent systems that take your PyTorch code and spit out CUDA or Triton kernels with 2x to 14x speedups over torch.compile(mode='max-autotune-no-cudagraphs'). Not on toy benchmarks. On real models like Llama-3.1-8B, Whisper, and Stable Diffusion