|
|
|
|
|
by quantadev
620 days ago
|
|
ReLU technically has a non-linearity at zero, but in some sense it's still even MORE linear than tanh or sigmoid, so it just demonstrates even better than tanh-type squashing that all this LLM stuff is being done ultimately with straight line math. All a ReLU function does is choose which line to use, a sloped one or a zero one. |
|
I’ll just reiterate that the single “technical” (whatever that means) nonlinearity in ReLU is exactly what lets a layer approximate any continuous[*] function.
[*] May have forgotten some more adjectives here needed for full precision.