|
The non linearities are fundamental. Without them, any arbitrarily deep NN is equivalent to a shallow NN (easily computable, as GP was saying), and we know those can't even solve the XOR problem. > nothing but linearity No, if you have non linearities, the NN itself is not linear.
The non linearities are not there primarily to keep the outputs in a given range, though that's important, too. |
So, for XOR, (x, y) -> (x, y, xy), and it becomes trivial for a linear NN to solve.
Architectures like Mamba have a linear recurrent state space system as their core, so even though you need a nonlinearity somewhere, it doesn't need to be pervasive. And linear recurrent networks are surprisingly powerful (https://arxiv.org/abs/2303.06349, https://arxiv.org/abs/1802.03308).