| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by theanonymousone 279 days ago
	But the RAM+VRAM can never be less than the size of the total (not active) model, right?

1 comments

NitpickLawyer 279 days ago

Correct. You want everything loaded, but for each forward pass just some experts get activated so the computation is less than in a dense model.

That being said, there are libraries that can load a model layer by layer (say from an ssd) and technically perform inference with ~8gb of RAM, but it'd be really really slow.

link

theanonymousone 279 days ago

Can you give me a name please? Is that distributed llama or something else?

link

skirmish 279 days ago

I have not used it but this is probably it: https://github.com/lyogavin/airllm

link