Hacker News new | ask | show | jobs
by Hawkenfall 2404 days ago
A more in-depth paper about this found the Swish activation often outperformed other functions: https://arxiv.org/abs/1710.05941
3 comments

Most of the recent research is moving to GELU (Gaussian Error Linear Units) activation functions: https://arxiv.org/pdf/1606.08415.pdf
That's interesting. I didn't read the paper closely, but skipping to the pictures, it looks like ReLU, but smoothed out so the derivative is continuous. Intuitively, that seems useful.
I wasn’t aware of that one. Definitely interesting, thanks for sharing!