|
|
|
|
|
by quantadev
594 days ago
|
|
Yes even the original Transformers model had only millions of parameters and nonetheless showed "impressive capabilities" because it also had Self-Attention. If you know of any models that have had success (even at the GPT-2 level) without Self-Attention, I'd be interested to know what they are, because I don't know of any. |
|
There aren't many multi-billion-parameters non-transformer models because of path dependence, but that doesn't mean that only transformers can achieve this kind of results.