|
|
|
|
|
by zozbot234
8 days ago
|
|
You need wider batches to get effective reuse of experts in any given layer, but you absolutely can. DeepSeek V4 has tiny KV caches that make this quite feasible. When targeting consumer platforms that only have a limited amount of compute headroom to begin with, the approach is quite reasonable. |
|