| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by buildbot 457 days ago
	AI inference is actually typically bandwidth limited compared to training, which can re-use the weights for all tokens <sequence length> * <batch size>. Inference, specifically decoding, requires you to read all of the weights for each new token, so the flops per byte are much lower during inference!