Hacker News new | ask | show | jobs
by zander_jiang 15 days ago
tilert is a highly optimized megakernel, its a single kernel that does the entire decode pass, this enables overlapping weight loading with computation, eliminates cuda launch overhead (CUDA graph does not, contrary to what most people think), allows for more fine-grained pipelining. There're lots of blogs/papers on it. Its currently the best approach to maximize memory bandwidth. But megakernels are incredibly hard to optimize, and only work for small batch sizes (low throughput, hence high price), thats why we don't see them much in production.