Hacker News new | ask | show | jobs
by bravura 1047 days ago
Like a lot of research, unless there’s a clear explanation supported by rigorous study, they probably randomly hillclimbed a bunch of cool new one liner changes and stopped when it was time to start writing the paper and doing ablation studies.
1 comments

To be less glib, just wait until there are a bunch of papers picking Swiglu over Relu, and then you can stop handwringing. Because it doesn't really matter if there was a super specific concrete well-articulated reason that Swiglu worked well for their specific approach. You're still going to use Relu by default and quickly try Swiglu for now regardless.

It's fine, I waited a bit before default adopting Relu over Tanh for all hidden non-final (not outputting a probability) layers.

Thanks a lot for your explanations :)