Hacker News new | ask | show | jobs
by p1esk 843 days ago
I said “scale is all you need for transformers”. That has been true since GPT1. The best way to improve our best model today still seems to be “make it larger and train it on more data”.

If you disagree please suggest a better way, or at least provide evidence that scaling up no longer works for transformers.

1 comments

> at least provide evidence that scaling up no longer works for transformers.

Isnt the Mixture-of-Experts trend (GPT4 is MoE?) kinda of a proof ?

Of scale? I would think not. I would say they are evidence against scale because they are more an argument for multi agent systems. Scale is about a singular framework. What that means is debatable though (I mean anything we call a singular network can be decomposed into sub networks. It's messy), hence the other part of my comment about not scale solutions being claimed as scale.

> I said “scale is all you need for transformers”

No you didn't. What kicked this all off was

> What’s the main difference between an ape’s brain and a human brain? Scale.

Don't retcon.

I think they went MoE purely because straight up scaling from 175B to 1.8T is just too expensive. But it’s still 10x scaling, right?