|
|
|
|
|
by whimsicalism
1069 days ago
|
|
Amazing - 6.7 billion is significantly larger than I've seen any transformer alternative trained to so far (H3 only went up to 2.7b, e: oops - RWKV goes up to 14b)... cool to see that it appears to be scaling even better and the O(1) & O(N) scaling is great. Wish there was more consistency on source of training data. Training on just The Pile would enable more clean comparison with most promising transformer alternatives, like H3 and give a better sense of how robust the perplexity improvements cited are. |
|