| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by everdrive 5 hours ago
	What counts as a lot of memory? What could someone do with 16 GB of RAM?

6 comments

throwawayffffas 4 hours ago

Not much, the capable models won't fit unless you go with very low quantization but that leads to a lot of loss.

You generally want to run q8 or some kind of "6bit" quantization at least.

40GB of VRAM is the entry-point in my experience, you can run qwen 3.6 35b a3b with full context or qwen 27b with about 92k of context.

Before you get fully discouraged, you don't need 1 gpu with 40GBs you can use multiple cards, with minimum impact on performance.

link

zozbot234 5 hours ago

Modern inference engines can stream in weights from SSD in order to save on RAM, but this makes inference very slow, especially for the trivial single-session case. (Jury is still out on whether batching multiple sessions together can mitigate this well enough, but even then that's mostly helpful for the "running lots of inferences overnight and getting fresh results first thing in the morning" case. Which is interesting (the big third-party suppliers don't really offer a way of doing this at reasonable cost) but a bit of a niche.)

link

abalashov 5 hours ago

Not a ton. I'd say 64 GB minimal to play, 96-128 GB better.

link

throwawayffffas 4 hours ago

Nah, you can run the 24b - 35b class with between 90k and 256k of context with about 40GB and they are pretty good. Especially the MOE variants fit neatly in 40GB.

link

abalashov 58 minutes ago

Yeah, but then you need RAM for the rest of your OS and applications. I'd say 64 to be comfortable in the sense to which most HN users are accustomed.

link

ValdikSS 5 hours ago

Gemma e2b, Gemma e4b. It's made for smartphones basically. You can run e2b with 8GB RAM.

link

trouve_search 5 hours ago

gemma 12B 4bit quant; try something with MTP and an AWQ quant

link

monegator 5 hours ago

gemma runs pretty well

link