| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by MacsHeadroom 929 days ago
	Not 7B, 8x7B. It will run with the speed of a 7B model while being much smarter but requiring ~24GB of RAM instead of ~4GB (in 4bit).

1 comments

dragonwriter 929 days ago

Given the config parametes posted, its 2 experts per token, so the conputation cost per token should be the cost of the conponent that selects experts + 2× cost of a 7B model.

link

MacsHeadroom 929 days ago

Ah good catch. Upon even closer examination, the attention layer (~2B params) is shared across experts. So in theory you would need 2B for the attention head + 5B for each of two experts in RAM.

That's a total of 12B, meaning this should be able to be run on the same hardware as 13B models with some loading time between generations.

link

stavros 929 days ago

Yes, but I also care about "can I load this onto my home GPU?" where, if I need all experts for this to run, the answer is "no".

link

MacsHeadroom 928 days ago

The answer is yes if you have a 24GB GPU. Just wait for 4bit quantization.

Or watch Tim Dettmers, who is releasing code to run Mixtral 8x7b in just 4GB of RAM.

link