|
|
|
|
|
by uh_uh
622 days ago
|
|
> The tanh smashing function just makes sure nothing can blow up into large numbers and all outputs are in a range of less than 1. That's not the main point even though it probably helps. As OkayPhysicist said above, without a nonlinearity, you could collapse all the weight matrices into a single matrix. If you have 2 layers (same size, for simplicity) described by weight matrices A and B, you could multiply them and get C, which you could use for inference. Now, you can do this same trick not only with 2 layers but 100 million, all collapsing into a single matrix after multiplication. If the nonlinearities weren't there, the effective information content of the whole NN would collapse into that of a single-layer NN. |
|