| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jasonni 766 days ago
	from their announcement, "Isn’t inference bottlenecked on memory bandwidth, not compute?", it seems weights are still in memory. It may have limit onchip cache for computing. Input tokens go through a batch pipeline to relieve memory bottleneck. Similar to Groq.