|
|
|
|
|
by littlestymaar
595 days ago
|
|
This doesn't match with the common knowledge on the topic, which is that model size is more important than the architecture. And training size is even more important, which is why single digit billion parameters are strongers than hundreds-of-billion ones from several years early when “Chinchilla optimal training” was in fashion. SSM are literally the proof that all that really matters is training scalability. The Universal approximation theorem doesn't care about the architecture after all. |
|
Take our "current large size" (my words from last post) LLMs, as they are currently today, and then simply remove the Self-Attention wiring, and see if that destroys the emergent intelligence aspect or not. I claim it would. But at the same time this doesn't mean you can just stick Self-Attention onto a small model and expect intelligence to once again emerge.