| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by MacsHeadroom 929 days ago
	Ah good catch. Upon even closer examination, the attention layer (~2B params) is shared across experts. So in theory you would need 2B for the attention head + 5B for each of two experts in RAM. That's a total of 12B, meaning this should be able to be run on the same hardware as 13B models with some loading time between generations.