Hacker News new | ask | show | jobs
by binarybana 6 days ago
I'm excited to see what cuTile-rs unlocks. Like the direction of HuggingFace's grout https://github.com/huggingface/grout project for local LLM inference:

- state of the art performance

- codebase that fits in a context window (including kernel definitions!)

- single binary deployment

Similar to antirez's ds4.c, but in Rust and with cuTile making kernels both easier to author and higher performance.

2 comments

Hey! Eric here, one of the folks behind Grout (from HF). The small codebase was a deliberate goal, as the whole engine including kernels is meant to be minimal and readable end to end, which is only practical because cuTile lets us write the kernels in Rust instead of a separate CUDA file. I think this makes things super promising for Rust + CUDA development and rapid iteration!
Yes! That's exactly the spirit. The readable, single-binary, kernels-included codebase is a big part of what makes Grout fun, and the antirez parallel is accurate.

There's a parallel to ThunderKittens too, on the kernel-authoring side: tile-based abstractions for writing fast kernels. The twist with cuTile Rust is the safety layer on top, carrying Rust's ownership model into the kernels: it's a safe, high-performance programming model, not just a perf DSL. The safe surface API is fairly domain-specific today (dense tensor/tile ops), and the Tile IR compiler is still maturing, but it's showing real promise for sparse and multi-GPU. Excited to see where those go. :)