| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fspeech 264 days ago
	It uses 75% linear attention layers so it is inherently lower cost. And it is MOE so active parameters are far lower.