|
|
|
|
|
by kouteiheika
159 days ago
|
|
If you want to prove (i.e. show that it works and/or it's faster in a real-world scenario) a new alternative to attention without breaking the bank then one of the best ways to do that would probably be to retrain an already existing model, just with swapped attention modules. Then once you have such a model you can do apples-to-apples benchmarks. This has been done successfully in the past: https://huggingface.co/featherless-ai/QRWKV-72B Note that this is a 72B model which would be very expensive to train from scratch, but here they did the conversion for less than $2000. |
|
https://github.com/KellerJordan/modded-nanogpt