|
|
|
|
|
by quantadev
594 days ago
|
|
My statements (which you disagreed with, without exception) haven't been about Transformers v.s. non-Transformers. Everything above has been about the importance of the Self-Attention part of it. We could remove Self-Attention from Transformers and still have a functional (but dumb) NN, and that was my point. Your position was that the Self-Attention is a less important part (because UAT, yadda yadda), and my position was that it's the key ingredient. Every statement above that I made, that you called wrong, was correct. lol. |
|
You claimed that self attention was needed to achieve the level of intelligence that we've seen with GPT 3.5:
> without those attention heads even the scaling up to current parameter sizes we have to day would not have ended up with the level of emergent intelligence that shocked the world with GPT 3.5. (Verbatim quote from you https://news.ycombinator.com/item?id=41986010)
This is the claim I've been disputing, by responding that the key feature of the intelligence of tranformer models come from their scalability. And now that we have alternative that scale equally well (SSM and RWKV) unsurprisingly we see them achieve the same level of reasoning abilities.
> Every statement above that I made, that you called wrong, was correct. lol.
Well, except the one quoted above at least…