Hacker News new | ask | show | jobs
by jbellis 7 days ago
it is hard to understand what the actually meaningful innovations are here / what TileRT is bringing to the table.

- dflash: new-ish but February is ancient by the standards of the pace of AI innovation lately, I guess applying it to a 1T model is new-ish in the sense that the dflash researchers don't have the hw budget to prove that out - persistent engine kernel: this is like CUDA 101 - warp specialization: I think this just means "keep different gpu resources all busy w/ pipelining" which is CUDA 201, some of it is even baked into pytorch now - MXFP4 QAT: not new - TileRT: hard to tell what this actually does, there's a PyPi wheel with support for DS 3.2 and GLM 5 but binary only

1 comments

tilert is a highly optimized megakernel, its a single kernel that does the entire decode pass, this enables overlapping weight loading with computation, eliminates cuda launch overhead (CUDA graph does not, contrary to what most people think), allows for more fine-grained pipelining. There're lots of blogs/papers on it. Its currently the best approach to maximize memory bandwidth. But megakernels are incredibly hard to optimize, and only work for small batch sizes (low throughput, hence high price), thats why we don't see them much in production.