|
|
|
|
|
by littlestymaar
594 days ago
|
|
Even 1B parameters model show “impressive capabilities” for anyone not accustomed to the current state of the art. And there are plenty of relatively small models that perform as well as ChatGPT 3.5 when it was first released and felt like magic. “All” that was needed to get there was “just” feeding it more data. The fact that we were actually able to train billion parameters models on multiple trillion tokens is the key property of the transformers, there's no magic beyond that (it's already cool enough though): it's not so much that they are more intelligent, it's simply that with them we can brute-force in a scalable fashion. |
|
If you know of any models that have had success (even at the GPT-2 level) without Self-Attention, I'd be interested to know what they are, because I don't know of any.