Hacker News new | ask | show | jobs
by alexmolas 895 days ago
> Uh oh! We’re getting NaNs! It seems our values are too high, and when being passed to the next encoder, they end up being too high and exploding! This is called gradient explosion.

As far as I understand this is wrong. You're not computing gradients at any point, so this is no gradient explosion. I believe the problem is with the implementation of softmax, here [0] you have an explanation of how to implement a numerically stable softmax.

[0]: https://jaykmody.com/blog/stable-softmax/

1 comments

Yes, you're correct. I tried to connect a common training problem (gradient explosion and vanishing gradient) with the issue of softmax being sensitive to large values. I agree it's misleading/inaccurate, so will rewrite that part.

That said, the whole neural network will be sensible to large values, so it won't be fixed by a numerically stable softmax. The normalization is a key aspect for the network to work.