| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by coder543 826 days ago

It has 480B parameters total, apparently. You would only need 512GB of RAM if you were running at 8-bit. It could probably fit into 256GB at 4-bit, and 4-bit quantization is broadly accepted as a good trade-off these days. Still... that's a lot of memory.

EDIT: This[0] confirms 240GB at 4-bit.

[0]: https://github.com/ggerganov/llama.cpp/issues/6877#issue-226...

2 comments

kaibee 826 days ago

I know quantizing larger models seems to be more forgiving but I’m wondering if that applies less to these extreme-MoE models. It seems to be that it should be more like quantizing a 3B model.

link

coder543 826 days ago

4-bit is fine for models of all sizes, in my experience.

The only reason I personally don’t quantize tiny models very much is because I don’t have to, not because the accuracy gains from running at 8-bit or fp16 are that great. I tried out 4-bit Phi-3 yesterday, and it was just fine.

link

refulgentis 826 days ago

Yeah, and usually GPU RAM, unless you enjoy waiting for a minute for filling the context :(

link