Hacker News new | ask | show | jobs
by brucethemoose2 929 days ago
Looks like it will squeeze into 24GB once the llama runtimes work it out.

Its also a good candidate for splitting across small GPUs, maybe.

One architecture I can envision is hosting prompt ingestion and the "host" model on the GPU and the downstream expert model weights on the CPU /IGP. This is actually pretty efficient, as the CPU/IGP is really bad at the prompt ingestion but reasonably fast at ~14B token generation.

Llama.cpp all but already does this, I'm sure MLC will implement it as well.