| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ludwik 1101 days ago
	No, what they do is predict a single token that follows the preceding token sequence (which was indeed analyzed using self-attention). Longer output sequences are created by repeating this simple task multiple times, where the previously output tokens become part of the preceding token sequence.