|
|
|
|
|
by tux3
1254 days ago
|
|
Check out the illustrated transformer: https://jalammar.github.io/illustrated-transformer/ tl;dr: It decodes the output one word at a time, but at each step it can focus on any mix of words from the input via the attention mechanism.
So the output token n can't depend on future output token n+1 in GPT, but it can attend to any of the input tokens |
|