|
|
|
|
|
by cztomsik
395 days ago
|
|
I have no idea how it works actually (in google) but I wouldn't be surprised if it was just post-training because recently RWKV people did something similar: They replaced the whole attention mechanism with WKV (forward-only linear attention), and created such franken-stein just by post-training. The big wow moment about that is that it sort of implies that most of the useful knowledge is in the FFN, and attention itself is not that unique/important. https://substack.recursal.ai/p/qwerky-72b-and-32b-training-l... BTW: It could be also interesting to try use already trained attention and see how long the FFN itself takes in the gpt2 speedtraining (it would be against the rules but still very interesting IMHO - definitely something I'd like to read paper about)
https://github.com/KellerJordan/modded-nanogpt Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trained, and if both of these statements are true maybe we could just train everything much faster just by sharing fixed embeddings and attentions. |
|
In a way everyone knew this would work. Nobody did it because it's so inefficient even R and Python users thought that it would be ridiculously slow (or simply couldn't execute it enough to train to a reasonable extent)