Hacker News new | ask | show | jobs
by jmmcd 1211 days ago
Transformers aren't really a wonderful architecture in the sense of a great fit between the architecture and what we know about the task. (For comparison, I think convolutional networks are.)

What makes Transformers great is:

1. Can handle long sequences without large increase in number of parameters to be trained.

2. Parallelize better than previous sequence models, ie LSTM. If we could train LSTMs of the size and with the same training data size as current Transformers, they'd probably be just as good.

1 comments

So maybe RWKV [1] is the next step. It parallelizes even better and seems to have no sequence limit.

[1] https://github.com/BlinkDL/RWKV-LM