|
|
|
|
|
by quantadev
595 days ago
|
|
It's not just about size. Self-Attention is every bit as important as large size, because if we had the current large size, but without Self-Attention we wouldn't have the emergent intelligence. Also "size" isn't even a new innovation. Self-Attention was a new innovation. |
|
SSM are literally the proof that all that really matters is training scalability.
The Universal approximation theorem doesn't care about the architecture after all.