| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tarruda 930 days ago
	Still 7B, but now with 32k context. Looking forward to see how it compares with the previous one, and what the community does with it.

3 comments

MacsHeadroom 929 days ago

Not 7B, 8x7B.

It will run with the speed of a 7B model while being much smarter but requiring ~24GB of RAM instead of ~4GB (in 4bit).

link

dragonwriter 929 days ago

Given the config parametes posted, its 2 experts per token, so the conputation cost per token should be the cost of the conponent that selects experts + 2× cost of a 7B model.

link

MacsHeadroom 929 days ago

Ah good catch. Upon even closer examination, the attention layer (~2B params) is shared across experts. So in theory you would need 2B for the attention head + 5B for each of two experts in RAM.

That's a total of 12B, meaning this should be able to be run on the same hardware as 13B models with some loading time between generations.

link

stavros 929 days ago

Yes, but I also care about "can I load this onto my home GPU?" where, if I need all experts for this to run, the answer is "no".

link

MacsHeadroom 928 days ago

The answer is yes if you have a 24GB GPU. Just wait for 4bit quantization.

Or watch Tim Dettmers, who is releasing code to run Mixtral 8x7b in just 4GB of RAM.

link

brucethemoose2 929 days ago

We can't infer the actual context size from the config.

Mistral 7B is basically an 8K model, but was marked as a 32K one.

link

seydor 929 days ago

unfortunately too big for the broader community to test. Will be very interesting to see how well it performs compared to the large models

link

brucethemoose2 929 days ago

Not really, looks like a ~40B class model which is very runnable.

link

MacsHeadroom 929 days ago

It's actually ~13B class at runtime. 2B for attention is shared across each expert and then it runs 2 experts at a time.

So 2B for attention + 5Bx2 for inference = 12B in RAM at runtime.

link

brucethemoose2 929 days ago

Yeah. I just mean in terms of VRAM usage.

link

MacsHeadroom 928 days ago

Yes, that's what I mean as well.

It's between 7B and 13B in terms of VRAM usage and 70B in terms of performance.

Tim Dettmers (QLoRA creator) released code to run Mixtral 8x7b in 4GB of VRAM. (But it benchmarks better than Llama-2 70B).

link