|
|
|
|
|
by bigyabai
177 days ago
|
|
You don't need a REAP-processed model to offload on a per-expert basis. All MoE models are inherently sparse, so you're only operating on a subset of activated layers when the prompt is being processed. It's more of a PCI bottleneck than a CPU one. > And I don’t consider running a 1-bit superquant to be a valid thing here either. I don't either. MXFP4 is scalar. |
|
You're better off prioritizing the offload of the KV cache and attention layers to the GPU than trying to offload a specific expert or two, but the performance loss I was talking about earlier still means you're not offloading enough for a 96GB GPU to make things how they need to be. You need multiple, or you need a Mac Studio.
If someone buys one of these $8000 GPUs to run GLM-4.7, they're going to be immensely disappointed. This is my point.