|
|
|
|
|
by p1esk
843 days ago
|
|
I said “scale is all you need for transformers”. That has been true since GPT1. The best way to improve our best model today still seems to be “make it larger and train it on more data”. If you disagree please suggest a better way, or at least provide evidence that scaling up no longer works for transformers. |
|
Isnt the Mixture-of-Experts trend (GPT4 is MoE?) kinda of a proof ?