|
|
|
|
|
by zhihaojia
362 days ago
|
|
Thanks for the great feedback! Stanford's MegaKernel project tackles a similar challenge but focuses on manual CUDA implementation. While MPK takes a compiler-driven approach—users express their LLMs at the PyTorch level, and MPK automatically compiles them into optimized megakernels. Our goal is to make programming megakernels much more accessible. I completely agree that CUDA can be a limiting factor, especially for latency-sensitive workloads. As GPUs are becoming larger and faster, it's increasingly difficult to write standalone kernels that fully utilize hardware resources—particularly when optimizing for low latency with small batch sizes. > What are the chances we see your work land in PyTorch as an experimental backend? We're definitely excited about that direction. We believe MPK can help PyTorch support megakernel generation, and we’re actively exploring how to make that happen. Stay tuned! > P.S. minor typo, your first two paragraphs under part 1 are nearly identical. Thanks for pointing it out--I meant to remove the duplicate paragraph when finalizing the post. |
|
Thank you !