|
|
|
|
|
by quickthrower2
1056 days ago
|
|
That simple comment is a strong counterpoint to the entire blog post? Except with the +1 denominator, it might be that the model trains all of the inputs to become very negative so softmax chucks out close to zeros, whereas it wouldn't bother before because making one prob bigger makes another smaller. |
|
It still can't do this because of L2 regularization / weight decay. If two vectors are norm 1, their inner product is at least -1, so with 2000 vectors that's still 2000 * e^(-1) =~ 735.
Not saying it's theoretically impossible that it could happen. But you would have to try _really_ hard to make it happen.