Hacker News new | ask | show | jobs
by tux3 1254 days ago
Check out the illustrated transformer: https://jalammar.github.io/illustrated-transformer/

tl;dr: It decodes the output one word at a time, but at each step it can focus on any mix of words from the input via the attention mechanism. So the output token n can't depend on future output token n+1 in GPT, but it can attend to any of the input tokens