| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zozbot234 54 days ago
	Qwen 27B maxes out at a 16GB context. A nice thing about DeepSeek V4, especially Flash, is that its context size stays tiny even at 1M tokens! Which in turn opens up wide batching on common consumer platforms.

1 comments

lostmsu 54 days ago

DeepSeek V4 Flash is 160GB while Qwen 27B is about 27GB. You can't even run DS Flash on consumer platforms, let alone batch it.

link

zozbot234 53 days ago

These are the sizes of model weights, not the KV cache. The former are a sparse (for MoE models) read workload that can be streamed from SSD.

link

lostmsu 53 days ago

You can't batch MoE

link

zozbot234 53 days ago

You need wider batches to get effective reuse of experts in any given layer, but you absolutely can. DeepSeek V4 has tiny KV caches that make this quite feasible. When targeting consumer platforms that only have a limited amount of compute headroom to begin with, the approach is quite reasonable.

link

lostmsu 53 days ago

Sounds like you're talking out of your butt instead of doing the math.

link

zozbot234 53 days ago

What do you mean by doing the math? If you repeatedly sample n_active experts out of n_total, why wouldn't you expect to get some meaningful probability of reuse/overlap once your batch grows past size 5 or so (for the sparsest MoE models in common use)? And you only need enough reuse to fill the compute headroom which is quite small on consumer platforms (we won't have huge TOPS numbers for the typical integrated GPU in Strix Halo or even the upcoming RTX Spark). Plus if you're a single user running multiple streams in parallel the choice of experts will be highly biased leading to more reuse.

link