| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Aurornis 69 days ago
	Additional VRAM is needed for context. This model is a MoE model with only 3B active parameters per expert which works well with partial CPU offload. So in practice you can run the -A(N)B models on systems that have a little less VRAM than you need. The more you offload to the CPU the slower it becomes though.

1 comments

Isn't that some kind of gambling if you offload random experts onto the CPU?

Or is it only layers but that would affect all Experts?

Pretty sure all partial offload systems I’ve seen work by layers, but there might be something else out there.

Speculative decoding is already gambling.