|
|
|
|
|
by rao-v
115 days ago
|
|
So llama.cpp currently statically puts overflow MoE experts in RAM and inferences them on CPU, so you get a mix of GPU + CPU inferencing. You are rooflined by RAM->CPU bandwidth + CPU compute. With good predictability of MoE, you might see a world were it's more efficient to spend PCI bandwidth (slower than RAM->CPU) on loading MOE experts for the next ~3 layers from RAM to VRAM so you are not rooflined by CPU compute. VLLM / SGLang (AFAIK) just assume you have enough VRAM to fit all the experts (but will page KV cache to RAM). |
|