Hacker News new | ask | show | jobs
by MacsHeadroom 929 days ago
Not 7B, 8x7B.

It will run with the speed of a 7B model while being much smarter but requiring ~24GB of RAM instead of ~4GB (in 4bit).

1 comments

Given the config parametes posted, its 2 experts per token, so the conputation cost per token should be the cost of the conponent that selects experts + 2× cost of a 7B model.
Ah good catch. Upon even closer examination, the attention layer (~2B params) is shared across experts. So in theory you would need 2B for the attention head + 5B for each of two experts in RAM.

That's a total of 12B, meaning this should be able to be run on the same hardware as 13B models with some loading time between generations.

Yes, but I also care about "can I load this onto my home GPU?" where, if I need all experts for this to run, the answer is "no".
The answer is yes if you have a 24GB GPU. Just wait for 4bit quantization.

Or watch Tim Dettmers, who is releasing code to run Mixtral 8x7b in just 4GB of RAM.