| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jmmcd 1211 days ago

Transformers aren't really a wonderful architecture in the sense of a great fit between the architecture and what we know about the task. (For comparison, I think convolutional networks are.)

What makes Transformers great is:

1. Can handle long sequences without large increase in number of parameters to be trained.

2. Parallelize better than previous sequence models, ie LSTM. If we could train LSTMs of the size and with the same training data size as current Transformers, they'd probably be just as good.

1 comments

naasking 1210 days ago

So maybe RWKV [1] is the next step. It parallelizes even better and seems to have no sequence limit.

[1] https://github.com/BlinkDL/RWKV-LM

link