Hacker News new | ask | show | jobs
by p1esk 843 days ago
Scale is all we need for transformers (so far). It might also be all we need for ape brains. It’s not all we need for whatever elephant or dolphin brains evolved from.

When this stops being the case for transformers, we will need something else. I’m just pointing out it’s not the case yet.

1 comments

I see no evidence of this in biology nor in ML. I've read those scale papers. I've worked on scale myself. I'll bet the farm that scale isn't all you need. But I won't be surprised if people say that it is all scale.

If you really think it is all scale, train a 7T ResNet MLP based model for NLP. If scale is all you need, make a LLM without DPO or RLHF. If scale is all you need, make SD3 with a GAN. Or what about a VAE, Normalizing Flow, HMM? Do it with different optimizers. Do it with gradient free methods. Do it with different loss functions.

The bitter lesson wasn't "scale is all you need." That's just a misinterpretation.

Edit: It's fine to disagree. We can compete on the ideas and methods. That's good for our community. So continue down yours, and I'll continue down mine. All I ask is that since your camp is more popular, you don't block those on my side. If you're right, we'll get to AGI soon. If we're right, we still might. But if we're right and you all block us, we'll get another AI winter in between. If we're right and you all don't block us, we can progress without skipping a beat. Just don't put all your eggs in one basket. It isn't good for anyone.

I said “scale is all you need for transformers”. That has been true since GPT1. The best way to improve our best model today still seems to be “make it larger and train it on more data”.

If you disagree please suggest a better way, or at least provide evidence that scaling up no longer works for transformers.

> at least provide evidence that scaling up no longer works for transformers.

Isnt the Mixture-of-Experts trend (GPT4 is MoE?) kinda of a proof ?

Of scale? I would think not. I would say they are evidence against scale because they are more an argument for multi agent systems. Scale is about a singular framework. What that means is debatable though (I mean anything we call a singular network can be decomposed into sub networks. It's messy), hence the other part of my comment about not scale solutions being claimed as scale.

> I said “scale is all you need for transformers”

No you didn't. What kicked this all off was

> What’s the main difference between an ape’s brain and a human brain? Scale.

Don't retcon.

I think they went MoE purely because straight up scaling from 175B to 1.8T is just too expensive. But it’s still 10x scaling, right?