|
|
|
|
|
by antirez
55 days ago
|
|
> The relative differences between values get exaggerated, which means the largest logit value dominates the output, while smaller values are squashed. This is exactly what we want for confident predictions, but it also explains why softmax can be problematic when you want uncertainty estimates Actually I believe that most of the times even after softmax, sampling is ways too permissive, seldom accepting low quality candidates. We all have the experience of seeing frontier LLMs sometimes putting a word in a different language that is really off-putting and almost impossible to explain, or other odd errors in just a single word of the output: most of the times, this is not what the model wanted to say, but sampling that casually selected a low quality token. I believe a better approach is to have a strong filter on which candidates are acceptable, like in the example here: https://antirez.com/news/142 |
|
If you take instances of softmax that you find in training / inference and there turn out to be a few, and use other things like entmax or sparsemax you see across the board improvements. And like top1 often is just the best answer too, there's a reason why when you're doing tool calls temp=0 is the way to go. Like do you really want creative unicode tokens when writing bash commands. From what I can tell, most of the time softmax is the worst answer that works.