| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by goldenarm 5 days ago
	Consumer and server hardware are quite different, especially Google's TPUs. They notably have much larger mixture-of-experts ratios and more complex caching systems. At such scale and inference budgets, they are incentivised to optimize as much as possible. Also Google Deepmins has a six month embargo on strategic papers, so I bet the juiciest quantization tech isn't public yet.