This is the writer of the blog post. You are right that Stanford's work is a parallel effort. The main difference is that our focus is on compilation: making it easier to generate megakernels automatically.
Ooops, missed one sentence in my previous response. Stanford's MegaKernel project tackles a similar challenge but focuses on manual CUDA implementation. While MPK takes a compiler-driven approach—users express their LLMs at the PyTorch level, and MPK automatically compiles them into optimized megakernels. Our goal is to make programming megakernels much more accessible.