|
|
|
|
|
by newhouseb
1114 days ago
|
|
Oh good catch! The author defines a layer norm layer but then... comments it out in the actual implementation (I missed the fact that it was commented out). So that answers my second question of what happens without it. Anecdotally in my own interpretability work (without layer norm), my models also learn rotations fairly frequently. I attributed this to the way I was doing positional embeddings (as rotations), but perhaps there's more to it. |
|