|
|
|
|
|
by Grosvenor
613 days ago
|
|
I guess the next step is to see if you're getting those mega activations as he describes. A/B test the two models and compare? Would be interesting to see if these activations only show up on larger models, or they're some relation to model size. |
|
Hah. Yes. It looks like they only show up in models with 6.7B parameters or more.
The problem can start at 125M. Small enough to test on a whim.
So train a model that exhibits these behaviours, then try it out.