Hacker News new | ask | show | jobs
by visarga 1292 days ago
> matrix multiplication to Relus (just a max(x,0))

Transformers famously employ the Softmax activation inside the attention matrix. Very rare to see Softmax anywhere other than the final layer.