| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nothing0001 1219 days ago
	I wonder what happens when one change the activation function, is there some related results in that direction?

1 comments

wcoenen 1219 days ago

ReLU is always used now, because it performs better than the alternatives.

link

rsfern 1219 days ago

For what it’s worth, I usually default to swish activations, which seem to be popular in my corner of graph neural nets (materials and chemistry). Performance is about the same as ReLU, and I like swish because it doesn’t have a hard discontinuity.

link

malux85 1219 days ago

Forgive my naive question - but what if the neural networks output range is -1 to +1, if the activation functions are ReLU doesn’t that mean a negative value cannot be produced?

link

nothing0001 1219 days ago

A negative value cannot be produced but in the hidden layers the output of one neuron is multiplied by weights so the sign can be encoded in those weights.

link

hansvm 1219 days ago

The strategy for most of these things is that most of the network builds up a bunch of "shapes", and you have a final layer that projects those appropriately into the output space. The intermediate layers can use basically any activation they want that has desirable convergence properties, and at the end you might have a linear projection (or full MLP layer) followed by a sigmoid or other reshaping. The GPT family uses "softmax" -- an exponentially weighted norming function that scales all the outputs to sum to 1 (since they represent probabilities for each of the next tokens).

link

rsfern 1219 days ago

If your output range is bounded like that then you probably want a sigmoid or tanh activation (with some shifting and scaling maybe) on the output layer. But the hidden layers can still use ReLU without issue

link

boredumb 1219 days ago

A 'leaky relu' is used to retain some of the negative stuff and prevent neurons from dying young. I just googled around and now see "PReLU" which seems to also address negative value.

link

p1esk 1218 days ago

Almost no one uses ReLU anymore. It's usually GeLU for llms, and Swish for images.

link

bilsbie 1219 days ago

Why is it good to ignore values less than zero?

link

wcoenen 1219 days ago

Intuitively it makes sense that certain nodes only "activate" when the concept/feature that they have learned is actually present in the input.

(Whether this is the correct interpretation, I'm not sure.)

link

hansvm 1219 days ago

ELU is pretty common, no?

link