|
|
|
|
|
by Der_Einzige
498 days ago
|
|
You and the OP talk a lot of smack about logprobs but we show that using them in even the simple case of dynamic truncation of your cutoff point (min_p sampler vs static top_p/top_k) leads to extreme performance improvements (especially on small models) and unlocks very high temperature sampling (for more creativity/less slop/better synthetic data-gen): https://arxiv.org/abs/2407.01082 [1]. Indeed, ultra high temperature sampling in its own right should be studied. I can do top_k = 2 and temperature = system.maxint and get decent results which are extraordinarily creative (with increasing probability of token related spelling issues as top_k goes up). I'm convinced that the models logprobs hold so much bloody value and knowledge that I unironically do not care about how many "theoretical guarantees" it lacks or about it's non-correspondence to our usage of language. [1]: Btw, this paper is now ICLR 2025 accepted and likely going to get an oral/honorable mention since we are ranked #18 out of all submissions by scores and have extremely favorable meta-review. Peer review seems to agree with our claims of extreme performance improvements. |
|
We may be talking about two orthogonal things here. And also to be clear, I don't care about theoretical guarantees either.
Now, min-p is solving for the inadequacies of standard sampling techniques. It is almost like a clever adaptive search which other sampling methods fail at (despite truncations like top-k/top-p).
However, one thing that I noticed in the min-p results was that lower temperatures were almost always better in the final performance (and quite expectedly the inverse for creating writing). This observation makes me think that the underlying model is generally fairly good at ranking the best tokens. What sampling allows us is a margin-for-error in cases where the model ranked a relevant next token not at the top, but slightly lower.
Therefore, my takeaway from min-p is that it solves for deficiencies of current samplers but its success is not in contradiction to the fact that logprobs are bad proxies for semantics. Sampling is the simplest form of search, and I agree with you that better sampling methods are a solid ingredient to extract information from logprobs.