| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by impossiblefork 660 days ago

I've tried that in a small transformer that I trained from scratch and it didn't really make any difference. I also made a version where I made this trainable somehow, probably by replacing the 1 with a constant associated with the layer, and that didn't make any difference either.

I didn't follow Miller's proposal quite as he wrote it though and I put the mechanism in all the layers rather than avoiding it at the end.

My test doesn't absolutely rule out usefulness-- there's always different ways of applying something, but I saw no indication of it.

1 comments

Grosvenor 660 days ago

I guess the next step is to see if you're getting those mega activations as he describes.

A/B test the two models and compare?

Would be interesting to see if these activations only show up on larger models, or they're some relation to model size.

link

Grosvenor 659 days ago

https://news.ycombinator.com/item?id=36871528

Hah. Yes. It looks like they only show up in models with 6.7B parameters or more.

The problem can start at 125M. Small enough to test on a whim.

So train a model that exhibits these behaviours, then try it out.

link