Hacker News new | ask | show | jobs
by HarHarVeryFunny 1214 days ago
Sure, but the point being discussed is that despite the word by word output, the output does not appear to be "chosen" on a word by word basis. OP investigated the case where the word "an" anticipates the following word ("an apple" vs "a pear").
2 comments

I see 2 options:

1. we don't know what they(coding layer between bing and GPT) look up and store as a prompt aka working memory.

2. it can do the equivalent of receiving it's own prompt silently.

I seen with code it outputs the step for the code then writes the code.

so there's some kind of plan and execute going on. maybe it can do that in model some how

>so there's some kind of plan and execute going on. maybe it can do that in model some how

The simple answer is that the internal state that picks the next token is stable over iterations so that the model can follow a consistent plan over multiple token outputs. Then as the plan "unfolds" in the output tokens, these tokens help stabilize the plan further, thus creating consistency over long generations.

Its chosen by the ngram and randomly so, that does suggest it is completing the text a word at a time.
Did you check the Vonnegut writing rules example I posted at top of this thread - in particular look at Bing/GPT's explanation of how its cake story matches up to Vonnegut's rules ? It's hard to imagine how it could have come up with such a coherent story, checking all the rules, if it was only conceiving of it's continuing story on a word by word basis. It's not as if sentence #1 matches rule number 1, sentence 2 matches rule number 2, etc. It seems there had to be some wholistic composition for it to do that.

Note too that despite the output being sampled from a distribution based on a "randomness" temperature, there are many case where what it is trying to say so much constrains the output that certain words/synonyms/concepts are all but forced.

Kurt Vonnegut is a conditional sub space of the embedding vectors.
It's easy to see that its not just doing one token at a time but is anticipating future tokens. Consider the context of a Q&A. The response might start with any of a number of words, exactly which word depends on what comes after. But if it randomly chooses the wrong word, it will either be forced to complete the wrong answer, or be backed into a corner and engage in circumlocutions to course-correct. This doesn't happen in practice for recent big models.