Hacker News new | ask | show | jobs
by quantadev 601 days ago
In the quote you're calling wrong (41986010), you're interpreting "scaling up" as "scaling up, including changing architecture". Scaling up transformers just means scaling up transformers, and keeping everything else the same. In other words you're interpreting "parameter size" as "parameter size, independent of architecture", and I meant parameter size of a Transformer (in the context of with v.s. without Self-Attention).
1 comments

Pathetic.
Straw-manning failed, so now you insult.
There's no staw-man, and you are now at the point of trying to re-invent the definition of words in order to somehow “win the argument ” without even respecting your own previous position. This behavior is legit pathetic, it's not an insult it's a fact. Respect yourself.
I stand by every word: 1) Self-Attention is more important than scale, and 2) to test that claim, simply remove SA from a transformer and see if it destroys the "intelligence" or not. There's nothing confusing about that, but thanks for your concerns and your polite words.
No that wasn't your argument and this new one is off course a much waker one that you fell back onto to be “technically right”.

That attention heads are mandatory for transformers is a tautology (without it a transformer is just an MLP…) so of course this statement is going to be correct, by definition.

But when you move the goal post to land on a tautology then you've surrendered your abilities to argue anything and you are just ridiculing yourself. Take this question of your for instance:

> If you know of any models that have had success (even at the GPT-2 level) without Self-Attention, I'd be interested to know what they are, because I don't know of any.

Which is a legit, non-ridiculous, one.

If you replace it with your later much weaker argument:

> > If you know of any MLP that have had success (even at the GPT-2 level), I'd be interested to know what they are, because I don't know of any.

Then it becomes a dumb question given that MLP have no way of encoding context and can't process sequences of words in the first place.

So when you argue that it was your argument all along, it's particularly embarrassing because you're just arguing that your previous arguments were equally dumb even when they weren't.

That's why I said you're disrespecting your earlier argumentation by retreating to your later tautology.

Your ad hominem ratcheted up again. lol. It's ok. No prob. Learn what a tautology is tho bro. It's perfectly legit to discuss how a Transformer would perform if only the Self-Attention part was removed (and everything else kept constant), as an experiment, to refute someone's bizarre claim that the SA part isn't doing the real magic in them. Insofar as the actual other networks you've mentioned they fail to beat Transformers, and will continue to fail, until something analogous to SA is built into them, because language comprehension simply cannot be done without sensitivity to word context, especially over "long ranges" in the input sequences.