| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kolinko 866 days ago
	If I’m not mistaken, for parallel inference requests and for prompt preprocessing it’s compute bound. Also, if you have just a single model you want to optimise (and not the training), you could build an array of asics that do specific matrix computations - then you don’t need to read weights from memory at all.