| GPT is a transformer model. Transformers use the attention mechanims. The mechanism is entirely concerned with retaining semantic context and semantic "global dependencies" spanning the entire input and output. https://ar5iv.labs.arxiv.org/html/1706.03762 "Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences ... In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output." Beyond that, also note that LLMs are probabilistic machines. Output spat out can vary and there are a handful of knobs (such as temperature) to modulate that output. Finally, I'm pretty sure we (or the workers in the field more like it /g) don't have a firm grasp on why certain failure modes occur. Likely this is due to the fact that we (they) also don't really have a good grasp on how the damn thing actually works its 'magic'. What is clear is that a significant subset of our semantic universe is embedded in symbols and their usage by us and this subset is somehow encoded in neural nets. This captured subset in LLMs is what drives their uncanny generative abilities. What is missing is precisely what would make it plausibly intelligent, plausibly a reasoning agent operating in a coherent semantic context. There are some who claim our minds are just like LLMs. Some of us who pay attention to our minds sometimes catch it making nonsensical noises and correct it. (As you age you begin to notice these things..) So it is interesting to this sentient (who makes claims to being) that my mind is just like my body, it is aging, certain parts are degraded, etc., but my 'whateveritis' that is me, my self, is as timeless as ever, and seems to be a spectator of the aging mechanism .. |
This is not quite true: GPT, specifically, is auto-regressive. It computes things only looking back, not forward.
Given that each token has only a fixed computing budget, it is likely that GPT precomputes information which will be relevant to later tokens, to be routed via attention.
In fact, this effect was demonstrated in practice: e.g. in a prompt like "Question: Where is the Eiffel tower located? Answer: " people found that information about "Paris" is routed from tokens "Eiffel tower", i.e. this associative memory was looked up earlier than it was needed.
So I was answering from that perspective: it can do better if it knows what to pre-compute.