Hacker News new | ask | show | jobs
by ludwik 1101 days ago
No, what they do is predict a single token that follows the preceding token sequence (which was indeed analyzed using self-attention). Longer output sequences are created by repeating this simple task multiple times, where the previously output tokens become part of the preceding token sequence.