| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by d-z-m 846 days ago
	32 isn't quite enough to run a decent quant of mixtral(on a Macbook). You could try a Q3_K_M, but not sure how lobotomized it would be.

1 comments

CuriouslyC 846 days ago

That's not true, the GGUF quants aren't great but there are exl2 4bit quantizations floating around that are pretty sweet.

link

brucethemoose2 846 days ago

exl2 is Nvidia/AMD only.

But GGUF Mixtral should fit in 32GB... just not with the full 32K context. Long context is very memory intense in llama.cpp, at least until they fully implement flash attention and a quantized cache.

link

d-z-m 846 days ago

fair enough, yeah I'm talking about GGUF quants only.

link