Hacker News new | ask | show | jobs
by quickthrower2 1056 days ago
That simple comment is a strong counterpoint to the entire blog post?

Except with the +1 denominator, it might be that the model trains all of the inputs to become very negative so softmax chucks out close to zeros, whereas it wouldn't bother before because making one prob bigger makes another smaller.

1 comments

> it might be that the model trains all of the inputs to become very negative

It still can't do this because of L2 regularization / weight decay. If two vectors are norm 1, their inner product is at least -1, so with 2000 vectors that's still 2000 * e^(-1) =~ 735.

Not saying it's theoretically impossible that it could happen. But you would have to try _really_ hard to make it happen.

I guess you could add a sort of gating operation with a learnable parameter that sends the value to -inf if doesn't reach the threshold.

Of course it might have some other serious repercussions.