Hacker News new | ask | show | jobs
by quantadev 594 days ago
If you parse my words a bit more carefully, you'll realize to test my claim there's a simple thought experiment (or real experiment) you can do which is this:

Take our "current large size" (my words from last post) LLMs, as they are currently today, and then simply remove the Self-Attention wiring, and see if that destroys the emergent intelligence aspect or not. I claim it would. But at the same time this doesn't mean you can just stick Self-Attention onto a small model and expect intelligence to once again emerge.

1 comments

You are wildly overestimating the “emergent capabilities” of current models, and underestimate alternative architectures's (namely SSM) performance at the same size.

Also, performance of the modern “small” models show that your last sentence isn't really true either.

> wildly overestimating the “emergent capabilities”

How could I be "overestimating" the emergent capabilities when I never even quantified those capabilities other than to call them "emergent" and impressive?

> “small” models show that your last sentence isn't true either.

I never said that even a perfect architecture would make small models "intelligent". However to the extent that even smaller LLMs can exhibit surprising capabilities, that's more evidence IN FAVOR OF everything I've said, not evidence against.

EDIT: But in that last sentence (of prior reply) by "small" what I meant was genuinely small, meaning non-LLM, and you seem to have interpreted it as "a smaller LLM"

Even 1B parameters model show “impressive capabilities” for anyone not accustomed to the current state of the art. And there are plenty of relatively small models that perform as well as ChatGPT 3.5 when it was first released and felt like magic.

“All” that was needed to get there was “just” feeding it more data. The fact that we were actually able to train billion parameters models on multiple trillion tokens is the key property of the transformers, there's no magic beyond that (it's already cool enough though): it's not so much that they are more intelligent, it's simply that with them we can brute-force in a scalable fashion.

Yes even the original Transformers model had only millions of parameters and nonetheless showed "impressive capabilities" because it also had Self-Attention.

If you know of any models that have had success (even at the GPT-2 level) without Self-Attention, I'd be interested to know what they are, because I don't know of any.

RWKV.

There aren't many multi-billion-parameters non-transformer models because of path dependence, but that doesn't mean that only transformers can achieve this kind of results.

My statements (which you disagreed with, without exception) haven't been about Transformers v.s. non-Transformers. Everything above has been about the importance of the Self-Attention part of it. We could remove Self-Attention from Transformers and still have a functional (but dumb) NN, and that was my point.

Your position was that the Self-Attention is a less important part (because UAT, yadda yadda), and my position was that it's the key ingredient. Every statement above that I made, that you called wrong, was correct. lol.