| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by reissbaker 480 days ago
	By "lower" you mean cheaper/better? I suspect it's much higher throughput than vLLM, which in turn is much higher throughput than llama.cpp. The MLA kernel they just open-sourced seems to indicate that, although we'll see how it does in third party benchmarks on non-hobbled GPUs vs FlashAttention. They only released the BF16 version — whereas most people, including DeepSeek themselves, serve in FP8 — so it might not be immediately useful to most companies quite yet, although I imagine there'll be FP8 ports soon enough.

1 comments

i think they meant lower level.

It seems hard to guess. Could be lower level, lower performance, or lower compute cost.