Hacker News new | ask | show | jobs
by osanseviero 891 days ago
Yes, you're correct. I tried to connect a common training problem (gradient explosion and vanishing gradient) with the issue of softmax being sensitive to large values. I agree it's misleading/inaccurate, so will rewrite that part.

That said, the whole neural network will be sensible to large values, so it won't be fixed by a numerically stable softmax. The normalization is a key aspect for the network to work.