| HN Mirror

> They're using Llama.cpp which is an amazing tool for local inference but doesn't match fast inference frameworks like TensorRT-LLM/SGLang for production speeds and throughputs on Hopper GPUs.

That's something I thought about, but it wouldn't explain much, as they are roughly two orders of magnitude off in terms of cost, only a small fraction of which could be explain by performance of the inference engine.

> The Unsloth quantizations are really cool, but if you want to experiment with the R1 models in a smaller form factor the R1 Distills like Llama 70B are great and should run a lot faster as they take advantage of existing optimizations around inferencing llama-architecture models.

What kind of optimization do you have in mind? Because Deepseek having only 37B active parameters, which means ~12GB at this level of quantization, means inference ought to be much faster that a dense 70B model, especially unquantized, no? The Llama 70B distill would benefit from speculative decoding though, but it shouldn't be enough to compensate. So I'm really curious about what kind of llama-specific optimizations, and how much speed up you think they'd bring.