|
|
|
|
|
by hashmap
55 days ago
|
|
Yeah, softmax may have useful applications, but anytime you find yourself using the same hammer for everything that looks like a nail it's a bit of a red flag. If you take instances of softmax that you find in training / inference and there turn out to be a few, and use other things like entmax or sparsemax you see across the board improvements. And like top1 often is just the best answer too, there's a reason why when you're doing tool calls temp=0 is the way to go. Like do you really want creative unicode tokens when writing bash commands. From what I can tell, most of the time softmax is the worst answer that works. |
|