| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by drdeca 641 days ago
	Ah, yeah, I guess that probably is true of transformers in practice. I was thinking about something which strictly takes in a sequence of tokens and outputs a (possibly 1-hot) probability distribution over all possible next tokens. Such a thing running autoregressively would have to recompute y each time. But, if intermediate computations are cached, as with transformers in practice, then this isn’t necessary.