Hacker News new | ask | show | jobs
by nothing0001 1219 days ago
I wonder what happens when one change the activation function, is there some related results in that direction?
1 comments

ReLU is always used now, because it performs better than the alternatives.
For what it’s worth, I usually default to swish activations, which seem to be popular in my corner of graph neural nets (materials and chemistry). Performance is about the same as ReLU, and I like swish because it doesn’t have a hard discontinuity.
Forgive my naive question - but what if the neural networks output range is -1 to +1, if the activation functions are ReLU doesn’t that mean a negative value cannot be produced?
A negative value cannot be produced but in the hidden layers the output of one neuron is multiplied by weights so the sign can be encoded in those weights.
The strategy for most of these things is that most of the network builds up a bunch of "shapes", and you have a final layer that projects those appropriately into the output space. The intermediate layers can use basically any activation they want that has desirable convergence properties, and at the end you might have a linear projection (or full MLP layer) followed by a sigmoid or other reshaping. The GPT family uses "softmax" -- an exponentially weighted norming function that scales all the outputs to sum to 1 (since they represent probabilities for each of the next tokens).
If your output range is bounded like that then you probably want a sigmoid or tanh activation (with some shifting and scaling maybe) on the output layer. But the hidden layers can still use ReLU without issue
A 'leaky relu' is used to retain some of the negative stuff and prevent neurons from dying young. I just googled around and now see "PReLU" which seems to also address negative value.
Almost no one uses ReLU anymore. It's usually GeLU for llms, and Swish for images.
Why is it good to ignore values less than zero?
Intuitively it makes sense that certain nodes only "activate" when the concept/feature that they have learned is actually present in the input.

(Whether this is the correct interpretation, I'm not sure.)

ELU is pretty common, no?