Hacker News new | ask | show | jobs
by newhouseb 1114 days ago
Oh good catch! The author defines a layer norm layer but then... comments it out in the actual implementation (I missed the fact that it was commented out). So that answers my second question of what happens without it.

Anecdotally in my own interpretability work (without layer norm), my models also learn rotations fairly frequently. I attributed this to the way I was doing positional embeddings (as rotations), but perhaps there's more to it.

1 comments

Thinking about this more, softmax is also a form of normalization that could likely contribute to this phenomenon.