Hacker News new | ask | show | jobs
by quantadev 595 days ago
When discussing "Attention Heads" in the context of the Transformers Paper, there's no need to put the word "Self" in front of it, as in "Self-Attention". That's the context in which I used the word Attention above. Something similar to self-attention had pre-existed this paper, but not actual self-attention.

You're right that getting rid of "Recurrence" was another innovation, but removing it was probably more of a hack to make things parallelizable, than something that was architecturally justifiable from first principles (like self-attention is), because there's definite "power" in Recurrence (making it desirable), but it's just too costly to run that in LLMs because of CPU cycles.

1 comments

> removing it was probably more of a hack to make things parallelizable

But that's the entire point of it. Transformer-based LLM are “more intelligent” just because you can make them bigger and train them on bigger datasets because of this parallelization.

It's not just about size. Self-Attention is every bit as important as large size, because if we had the current large size, but without Self-Attention we wouldn't have the emergent intelligence. Also "size" isn't even a new innovation. Self-Attention was a new innovation.
This doesn't match with the common knowledge on the topic, which is that model size is more important than the architecture. And training size is even more important, which is why single digit billion parameters are strongers than hundreds-of-billion ones from several years early when “Chinchilla optimal training” was in fashion.

SSM are literally the proof that all that really matters is training scalability.

The Universal approximation theorem doesn't care about the architecture after all.

If you parse my words a bit more carefully, you'll realize to test my claim there's a simple thought experiment (or real experiment) you can do which is this:

Take our "current large size" (my words from last post) LLMs, as they are currently today, and then simply remove the Self-Attention wiring, and see if that destroys the emergent intelligence aspect or not. I claim it would. But at the same time this doesn't mean you can just stick Self-Attention onto a small model and expect intelligence to once again emerge.

You are wildly overestimating the “emergent capabilities” of current models, and underestimate alternative architectures's (namely SSM) performance at the same size.

Also, performance of the modern “small” models show that your last sentence isn't really true either.

> wildly overestimating the “emergent capabilities”

How could I be "overestimating" the emergent capabilities when I never even quantified those capabilities other than to call them "emergent" and impressive?

> “small” models show that your last sentence isn't true either.

I never said that even a perfect architecture would make small models "intelligent". However to the extent that even smaller LLMs can exhibit surprising capabilities, that's more evidence IN FAVOR OF everything I've said, not evidence against.

EDIT: But in that last sentence (of prior reply) by "small" what I meant was genuinely small, meaning non-LLM, and you seem to have interpreted it as "a smaller LLM"