| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wrs 665 days ago

With a ReLU activation function, rather than a simple linear function of the inputs, you get a piecewise linear approximation of a nonlinear function.

ReLU enables this by being nonlinear in a simple way, specifically by outputting zero for negative inputs, so each linear unit can then limit its contribution to a portion of the output curve.

(This is a lot easier to see on a whiteboard!)

1 comments

quantadev 665 days ago

ReLU technically has a non-linearity at zero, but in some sense it's still even MORE linear than tanh or sigmoid, so it just demonstrates even better than tanh-type squashing that all this LLM stuff is being done ultimately with straight line math. All a ReLU function does is choose which line to use, a sloped one or a zero one.

link

wrs 665 days ago

Well. The word “linear” the way you use it doesn’t seem to have any particular meaning, certainly not the standard mathematical meaning, so I’m not sure we can make further progress on this explanation.

I’ll just reiterate that the single “technical” (whatever that means) nonlinearity in ReLU is exactly what lets a layer approximate any continuous[*] function.

[*] May have forgotten some more adjectives here needed for full precision.

link

quantadev 665 days ago

If you're confused just show a tanh graph and a ReLU graph to a 7 year old child and ask which one is linear. They'll all get it right. So you're not confused in the slightest bit about anything I've said. There's nothing even slightly confusing about saying a ReLU is made of two lines.

link

wrs 664 days ago

Well, 7-year-olds don’t know a lot of math, typically, so I wouldn’t ask one that question. “Linear” has a very precise mathematical definition, which is not “made of some straight lines”, that when used properly enables entire fields of endeavor.

It would be less confusing if you chose a different word, or at least defined the ones you’re using. In fact, if you tried to precisely express what you mean by saying something is “more linear”, that might be a really interesting exploration.

link

quantadev 663 days ago

It's perfectly legitimate to discuss the linear aspects of piecewise linear functions. I've heard Andrej Karpathy do it in precisely same way I did on this thread, talking about ReLU.

We just have a lot of very pedantic types on HN who intentionally misinterpret other people's posts in order to have something to "disprove".

link

mickg10 665 days ago

I.e. ReLU is _piecewise_ linear. That discontinuity that separates the 2 pieces is precisely what makes it non linear. Which is what enables the actual universal approximation.

link

quantadev 665 days ago

Which is what I said two replies ago.

Followed by "in some sense it's [ReLU] still even MORE linear than tanh or sigmoid functions are". There's no way you misunderstood that sentence, or took it as my "definition" of linearity...so I guess you just wanted to reaffirm I was correct, again, so thanks.

link