| Hello! I'm one of the main authors of the paper. Thanks for engaging with our work so thoughtfully – that's a very clear and valid question. We didn't get around to addressing this within the paper itself – 80 pages is a lot, and deadlines, etc. But I have unpublished experiments that show that in a reasonably broad setting I'm doing some work in, verbalized probabilities are restoring a distribution that looks almost identical to the base distribution. It is not possible to demonstrate this on frontier models, since their public models are already mode-collapsed, and they don't share the base model or logprobs anyway. But I've established this to my personal satisfaction on large local models which offer base / post-trained pairs. To share some intuition on why one might believe this is occurring: there are a bunch of tasks implicit in the pre-training corpus that encourage the model to learn this capability. Consider sentences in news and research articles like: "Scientists discover that [doing something] increases [some outcome] on [some population] by X%". It seems quite natural that the model might learn a pathway by which it can translate its base probabilities into the equivalent numeric tokens in order to "beat" the task of reducing loss on the "X%" prediction. I can even almost visualize how this works mechanically in terms of what the upper layers of an MLP would do to learn this, i.e. translating from weights into specific token slots. And this is almost certainly more parameter-efficient than constructing an entire separate emulated reality for filling in X. Although I'm not ruling out that the latter might still be happening – perhaps some future interp research might be able to validate this! I'm actually working on a paper that packs up some of the above findings in passing. But if helpful in the meantime, this is also building on related work by Tian et al. 2023, "Just Ask for Calibration" [1] and Meister et al. 2024, "Benchmarking Distributional Alignment of LLMs" [2], that give some extra confidence here. Their findings indicate that whether or not they are rooted in the model's base probabilities, they seem to be useful for the purposes that people care about. (Oh, and you can probably set up an experiment to verify this independently with vLLM in a few Claude Code requests!) Hope that was helpful – feel free to ping with follow-ups! (Although replies might be a little delayed, I happened to see this at a good time; having quite a crunchy week) [1] https://arxiv.org/abs/2305.14975 [2] https://arxiv.org/abs/2411.05403 |