Did you check the Vonnegut writing rules example I posted at top of this thread - in particular look at Bing/GPT's explanation of how its cake story matches up to Vonnegut's rules ? It's hard to imagine how it could have come up with such a coherent story, checking all the rules, if it was only conceiving of it's continuing story on a word by word basis. It's not as if sentence #1 matches rule number 1, sentence 2 matches rule number 2, etc. It seems there had to be some wholistic composition for it to do that.
Note too that despite the output being sampled from a distribution based on a "randomness" temperature, there are many case where what it is trying to say so much constrains the output that certain words/synonyms/concepts are all but forced.
It's easy to see that its not just doing one token at a time but is anticipating future tokens. Consider the context of a Q&A. The response might start with any of a number of words, exactly which word depends on what comes after. But if it randomly chooses the wrong word, it will either be forced to complete the wrong answer, or be backed into a corner and engage in circumlocutions to course-correct. This doesn't happen in practice for recent big models.
Note too that despite the output being sampled from a distribution based on a "randomness" temperature, there are many case where what it is trying to say so much constrains the output that certain words/synonyms/concepts are all but forced.