| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cztomsik 395 days ago

I have no idea how it works actually (in google) but I wouldn't be surprised if it was just post-training because recently RWKV people did something similar: They replaced the whole attention mechanism with WKV (forward-only linear attention), and created such franken-stein just by post-training.

The big wow moment about that is that it sort of implies that most of the useful knowledge is in the FFN, and attention itself is not that unique/important.

https://substack.recursal.ai/p/qwerky-72b-and-32b-training-l...

BTW: It could be also interesting to try use already trained attention and see how long the FFN itself takes in the gpt2 speedtraining (it would be against the rules but still very interesting IMHO - definitely something I'd like to read paper about) https://github.com/KellerJordan/modded-nanogpt

Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trained, and if both of these statements are true maybe we could just train everything much faster just by sharing fixed embeddings and attentions.

5 comments

spwa4 395 days ago

Ever notice that attention is (with the highest respect to the original researchers) "just" inputting the entire past of the network into a reverse-MoE neural network? (meaning the expert is selecting parts of the input instead of parts of the neural network to execute)

In a way everyone knew this would work. Nobody did it because it's so inefficient even R and Python users thought that it would be ridiculously slow (or simply couldn't execute it enough to train to a reasonable extent)

link

scotty79 395 days ago

Attention is just completely arbitrary way to split the network so the learning can be parallelized.

What contributed more towards success in my opinion are "shortcut connections" through layers which enable more influence on early layers during learning.

link

grumbelbart2 395 days ago

> What contributed more towards success in my opinion are "shortcut connections" through layers which enable more influence on early layers during learning.

For those who don't know, that is the idea behind ResNet (He et al., Deep Residual Learning for Image Recognition, https://arxiv.org/abs/1512.03385), one of the most influential papers in deep learning of all time.

Residual connections make it possible to train networks that are arbitrarily deep. Before ResNet, networks that were too deep were essentially not trainable due to vanishing or exploding gradients.

link

scotty79 392 days ago

It's really nice to have your personal intuitions in a field you barely know confirmed by research.

link

cubefox 395 days ago

> Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trained

That was from here: https://news.ycombinator.com/item?id=44054425

link

jonahx 394 days ago

So is the famous "Attention is all you need" wrong?

link

slickytail 395 days ago

The relative unimportance of the exact SDPA attention in use in modern transformers is already known: https://arxiv.org/abs/2111.11418

The FFN, normalization, and residual connections are absolutely irreplaceable -- but attention can be replaced with almost any other layer that shares information between tokens, such as pooling, convolution, random mixing, etc.

link

cztomsik 394 days ago

hm, residual is what I would not expect, can you elaborate why?

link

simsla 394 days ago

Avoids vanishing gradients in deeper networks.

Also, most blocks with a residual approximate the identity function when initialised, so tend to be well behaved.

link