Hacker News new | ask | show | jobs
by jorlow 747 days ago
Note llama's feed forward is a bit different too:

  self.w2(F.silu(self.w1(x)) * self.w3(x))
I.e. the nonlinearity is a gate.

https://github.com/meta-llama/llama3/blob/14aab0428d3ec3a959...

1 comments

Fwiw, that's SwiGLU in #3 above. Swi = Swish = silu. GLU is gated linear unit; the gate construction you describe.