Hacker News new | ask | show | jobs
by ipnon 1103 days ago
Transformers don’t predict next tokens, right? They predict sequences based on their self-attention to some preceding token sequence?
1 comments

No, what they do is predict a single token that follows the preceding token sequence (which was indeed analyzed using self-attention). Longer output sequences are created by repeating this simple task multiple times, where the previously output tokens become part of the preceding token sequence.