| HN Mirror

No, what they do is predict a single token that follows the preceding token sequence (which was indeed analyzed using self-attention). Longer output sequences are created by repeating this simple task multiple times, where the previously output tokens become part of the preceding token sequence.