| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zozbot234 35 days ago
	KV cache size is the main constraint on batching (for any given ctx length), that's a huge deal for efficiency both locally and in the data center. DeepSeek V4's reduced KV requirement is a real game changer, it definitively unlocks batching requests together for local inference, not just at scale.

1 comments

anonym29 35 days ago

This may be relevant for parallelizable workloads. For reference on my perspective: I come at this as someone who is exclusively concerned with sequential, non-parallelizable, single-user, single-system workloads.

link

zozbot234 35 days ago

If you have multiple chats going at the same time in your LLM web interface, that's already a parallelizable workload wrt. batched inference. And this broadly describes the more sophisticated users of LLMs (who are using it for more than just casual chit-chat), especially wrt. the largest "pro" models. Parallelism is also quite applicable to agentic workloads.

link