| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zozbot234 41 days ago
	Keep in mind, I said serving many requests in parallel, not just many users. In fact it's even more efficient if you can batch the requests of a large subagent swarm in parallel since this allows for sharing a big chunk of context/KV cache not just the model weights. That's why I raised the possibility of leveraging this same efficiency with DeepSeek V4. If as a user I can get into the habit of just firing off a request to be cranked on in the background and be completed whenever, and I reach a compute-limited performance workload (just like the big inference labs that serve many users concurrently, only on a smaller scale since the overall compute bottleneck hits sooner) that's quite new wrt. local models. It used to be that we could only do that by spending huge amounts of money on very fast RAM and/or scaling out to multiple nodes. A big cloud vendor does not face the same opportunity, they cannot leverage the repurposing of your own existing hardware. And they'll definitely want to minimize latency in order to get maximum throughput/utilization from the hardware they did buy, even at an emergy cost. That's why I was careful to note latency as a possible factor before.

1 comments

gghh 41 days ago

Ah ok, sharing context/KV cache, I can see that helping. I need to learn more about DS V4, you seem to hint it has some advantages over previous generations in this respect. I haven't followed that closely to quite catch this argument, I'll check it out.

link

zozbot234 41 days ago

The basic argument is that its KV cache is roughly an order of magnitude more compact than previous Chinese models, which were already very compact compared to the likes of Gemma 4 (though that example is a bit of an extreme). If you pair this with the basic facts of how to maximize LLM inference performance at scale (this was recently talked about in a video lecture on the Dwarkesh Patel YouTube podcast) the case for doing slow batched inference on prem with DeepSeek V4, perhaps even with memory offload, becomes, as I see it, quite obvious. Of course, I'd like to be proven wrong!

link

gghh 41 days ago

Right, Dwarkesh's episode with Reiner Pope. Didn't watch the full video but as soon I saw both going to an old school blackboard with an actual chalk in hand I could tell they meant business hehe :) Thanks for recommending the vid and for the info about DS V4.

link