|
|
|
|
|
by popinman322
434 days ago
|
|
You can swap experts in and out of VRAM, it just increases inference time substantially. Depending on the routing function you can figure out all the active experts ahead of the forward pass for a single token and pipeline the expert loading. |
|