| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by littlestymaar 594 days ago
	Even 1B parameters model show “impressive capabilities” for anyone not accustomed to the current state of the art. And there are plenty of relatively small models that perform as well as ChatGPT 3.5 when it was first released and felt like magic. “All” that was needed to get there was “just” feeding it more data. The fact that we were actually able to train billion parameters models on multiple trillion tokens is the key property of the transformers, there's no magic beyond that (it's already cool enough though): it's not so much that they are more intelligent, it's simply that with them we can brute-force in a scalable fashion.

1 comments

quantadev 594 days ago

Yes even the original Transformers model had only millions of parameters and nonetheless showed "impressive capabilities" because it also had Self-Attention.

If you know of any models that have had success (even at the GPT-2 level) without Self-Attention, I'd be interested to know what they are, because I don't know of any.

link

littlestymaar 594 days ago

RWKV.

There aren't many multi-billion-parameters non-transformer models because of path dependence, but that doesn't mean that only transformers can achieve this kind of results.

link

quantadev 594 days ago

My statements (which you disagreed with, without exception) haven't been about Transformers v.s. non-Transformers. Everything above has been about the importance of the Self-Attention part of it. We could remove Self-Attention from Transformers and still have a functional (but dumb) NN, and that was my point.

Your position was that the Self-Attention is a less important part (because UAT, yadda yadda), and my position was that it's the key ingredient. Every statement above that I made, that you called wrong, was correct. lol.

link

littlestymaar 594 days ago

You are moving the goalpost. The discussion has always been about transformers vs non transformers.

You claimed that self attention was needed to achieve the level of intelligence that we've seen with GPT 3.5:

> without those attention heads even the scaling up to current parameter sizes we have to day would not have ended up with the level of emergent intelligence that shocked the world with GPT 3.5. (Verbatim quote from you https://news.ycombinator.com/item?id=41986010)

This is the claim I've been disputing, by responding that the key feature of the intelligence of tranformer models come from their scalability. And now that we have alternative that scale equally well (SSM and RWKV) unsurprisingly we see them achieve the same level of reasoning abilities.

> Every statement above that I made, that you called wrong, was correct. lol.

Well, except the one quoted above at least…

link

quantadev 593 days ago

In the quote you're calling wrong (41986010), you're interpreting "scaling up" as "scaling up, including changing architecture". Scaling up transformers just means scaling up transformers, and keeping everything else the same. In other words you're interpreting "parameter size" as "parameter size, independent of architecture", and I meant parameter size of a Transformer (in the context of with v.s. without Self-Attention).

link

littlestymaar 593 days ago

Pathetic.

link