Hacker News new | ask | show | jobs
by antirez 55 days ago
> The relative differences between values get exaggerated, which means the largest logit value dominates the output, while smaller values are squashed. This is exactly what we want for confident predictions, but it also explains why softmax can be problematic when you want uncertainty estimates

Actually I believe that most of the times even after softmax, sampling is ways too permissive, seldom accepting low quality candidates. We all have the experience of seeing frontier LLMs sometimes putting a word in a different language that is really off-putting and almost impossible to explain, or other odd errors in just a single word of the output: most of the times, this is not what the model wanted to say, but sampling that casually selected a low quality token. I believe a better approach is to have a strong filter on which candidates are acceptable, like in the example here: https://antirez.com/news/142

3 comments

Yeah, softmax may have useful applications, but anytime you find yourself using the same hammer for everything that looks like a nail it's a bit of a red flag.

If you take instances of softmax that you find in training / inference and there turn out to be a few, and use other things like entmax or sparsemax you see across the board improvements. And like top1 often is just the best answer too, there's a reason why when you're doing tool calls temp=0 is the way to go. Like do you really want creative unicode tokens when writing bash commands. From what I can tell, most of the time softmax is the worst answer that works.

> most of the times, this is not what the model wanted to say, but sampling that casually selected a low quality token.

How do you identify what the model wanted to say?

I would love to see more work on beam search/Viterbi decodes rather than just greedy next token output.
Regret analysis in bandit and similar algorithms shows how inference is connected to loss function. If your loss function is good, greedy inference is as good as joint inference.

Training on cost-to-go loss is good enough. Perfect cost-to-go eliminates the need for global algorithms and allows local decision making. Given “natural” datasets it is probably the best thing to attempt to learn. The fact that probabilistic graphical models never really worked proves it somewhat.

Are there any good papers on this you would suggest/specific search terms?

I am vaguely aware of some stuff, but would love to study more, I don't quite understand what this is all about (but I do see how LLMs can do attention to all prior tokens so you don't have the single-point-of-failure HMMs do which more necessitates Viterbi decodes)