Hacker News new | ask | show | jobs
by deyiao 480 days ago
I heard their inferencing framework is way lower than typical deployment methods. Can this be verified from that open-source project? How does it stack up against vllm or llama.cpp
4 comments

By "lower" you mean cheaper/better?

I suspect it's much higher throughput than vLLM, which in turn is much higher throughput than llama.cpp. The MLA kernel they just open-sourced seems to indicate that, although we'll see how it does in third party benchmarks on non-hobbled GPUs vs FlashAttention. They only released the BF16 version — whereas most people, including DeepSeek themselves, serve in FP8 — so it might not be immediately useful to most companies quite yet, although I imagine there'll be FP8 ports soon enough.

i think they meant lower level.
It seems hard to guess. Could be lower level, lower performance, or lower compute cost.
What do you mean by "lower"? To my understanding, they will open 5 infra related repos this week. Let's revisit your comparison question on Friday.
I don't see any use of PTX, might be in one of the other repos they plan to release.
right, I think PTX use is a bigger deal than its getting coverage for. this opens an opening for other vendors to get their foot in with PTX to LLVM-ir translation for existing cuda kernels.
Maybe. Apple ditched them in China, because their infra can't handle large scale users.
Don't think the decision is based on infra, or any technical reasons. It's more on the service support side. How a 200-person company supports 44M iPhone users in China?
Is that true? I thought Apple was going to use their own infrastructure.
deepseek doesn't have any experience on support a 50 million user base. that was the reason cited by apple a few weeks ago.