| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tux3 1254 days ago
	Check out the illustrated transformer: https://jalammar.github.io/illustrated-transformer/ tl;dr: It decodes the output one word at a time, but at each step it can focus on any mix of words from the input via the attention mechanism. So the output token n can't depend on future output token n+1 in GPT, but it can attend to any of the input tokens