Hacker News new | ask | show | jobs
by whimsicalism 1069 days ago
Amazing - 6.7 billion is significantly larger than I've seen any transformer alternative trained to so far (H3 only went up to 2.7b, e: oops - RWKV goes up to 14b)... cool to see that it appears to be scaling even better and the O(1) & O(N) scaling is great.

Wish there was more consistency on source of training data. Training on just The Pile would enable more clean comparison with most promising transformer alternatives, like H3 and give a better sense of how robust the perplexity improvements cited are.

1 comments

RWKV has 14B version.
good point, for some reason i always leave out rwkv when thinking of the transformer models.. perhaps because it is more of a redux