|
|
|
|
|
by impossiblefork
613 days ago
|
|
I've tried that in a small transformer that I trained from scratch and it didn't really make any difference. I also made a version where I made this trainable somehow, probably by replacing the 1 with a constant associated with the layer, and that didn't make any difference either. I didn't follow Miller's proposal quite as he wrote it though and I put the mechanism in all the layers rather than avoiding it at the end. My test doesn't absolutely rule out usefulness-- there's always different ways of applying something, but I saw no indication of it. |
|
A/B test the two models and compare?
Would be interesting to see if these activations only show up on larger models, or they're some relation to model size.