| You are observing "flattened logits" https://arxiv.org/pdf/2303.08774#page=12&org=openai . The entropy of ChatGPT (as well as all other generative models which have been 'tuned' using RLHF, instruction-tuning, DPO, etc) is so low because it is not predicting "most likely tokens" or doing compression. A LLM like ChatGPT has been turned into an RL agent which seeks to maximize reward by taking the optimal action. It is, ultimately, predicting what will manipulate the imaginary human rater into giving it a high reward. So the logits aren't telling you anything like 'what is the probability in a random sample of Internet text of the next token', but are closer to a Bellman value function, expressing the model's belief as to what would be the net reward from picking each possible BPE as an 'action' and then continuing to pick the optimal BPE after that (ie. following its policy until the episode terminates). Because there is usually 1 best action, it tries to put the largest value on that action, and assign very small values to the rest (no matter how plausible each of them might be if you were looking at random Internet text). This reduction in entropy is a standard RL effect as agents switch from exploration to exploitation: there is no benefit to taking anything less than the single best action, so you don't want to risk taking any others. This is also why completions are so boring and Boltzmann temperature stops mattering and more complex sampling strategies like best-of-N don't work so well: the greedy logit-maximizing removes information about interesting alternative strategies, so you wind up with massive redundancy and your net 'likelihood' also no longer tells you anything about the likelihood. And note that because there is now so much LLM text on the Internet, this feeds back into future LLMs too, which will have flattened logits simply because it is now quite likely that they are predicting outputs from LLMs which had flattened logits. (Plus, of course, data labelers like Scale can fail at quality control and their labelers cheat and just dump in ChatGPT answers to make money.) So you'll observe future 'base' models which have more flattened logits too... I've wondered if to recover true base model capabilities and get logits that actually meaningful predict or encode 'dark knowledge', rather than optimize for a lowest-common-denominator rater reward, you'll have to start dumping in random Internet text samples to get the model 'out of assistant mode'. |
Of course humans employ different thinking modes too - no harm in thinking like a stone cold programmer when you are programming, as long as you don't do it all the time.