|
|
|
|
|
by godelski
927 days ago
|
|
> Also, we know that transformers can scale Do we have strong evidence that other models don't scale or have we just put more time into transformers? Convolutional resnets look to scale on vision and language: (cv) https://arxiv.org/abs/2301.00808, (cv) https://arxiv.org/abs/2110.00476, (nlp) https://github.com/HazyResearch/safari MLPs also seem to scale: (cv) https://arxiv.org/abs/2105.01601, (cv) https://arxiv.org/abs/2105.03404 I mean I don't see a strong reason to turn away from attention as well but I also don't think anyone's thrown a billion parameter MLP or Conv model at a problem. We've put a lot of work into attention, transformers, and scaling these. Thousands of papers each year! Definitely don't see that for other architectures. The ResNet Strikes back paper is a great paper for one reason being that it should remind us all to not get lost in the hype and that our advancements are coupled. We learned a lot of training techniques since the original ResNet days and pushing those to ResNets also makes them a lot better and really closes the gaps. At least in vision (where I research). It is easy to railroad in research where we have publish or perish and hype driven reviewing. |
|