|
|
|
|
|
by seydor
929 days ago
|
|
looks like they're too busy being awesome. i need a fake video to understand this! What memory will this need? I guess it won't run on my 12GB of vram "moe": {"num_experts_per_tok": 2, "num_experts": 8} I bet many people will re-discover bittorrent tonight |
|
Its also a good candidate for splitting across small GPUs, maybe.
One architecture I can envision is hosting prompt ingestion and the "host" model on the GPU and the downstream expert model weights on the CPU /IGP. This is actually pretty efficient, as the CPU/IGP is really bad at the prompt ingestion but reasonably fast at ~14B token generation.
Llama.cpp all but already does this, I'm sure MLC will implement it as well.