| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by qeternity 807 days ago
	4bit should take up less than this, there are quite a few shared parameters between experts. But unless you’re running bs=1 it will be painful vs 8x GPU as you’re almost certain to be activating most/all of the experts in a batch.