| It's curse and a blessing that discussion of topics happens in so many different places. I found this comment on Twitter/X interesting: https://x.com/fchollet/status/1841902521717293273 "Interesting work on reviving RNNs. https://arxiv.org/abs/2410.01201 -- in general the fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning) Curve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape. As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime." |
I have almost the opposite take. We've had a lot of datasets for ages, but all the progress in the last decade has come from advances how curves are architected and fit to the dataset (including applying more computing power).
Maybe there's some theoretical sense in which older models could have solved newer problems just as well if only we applied 1000000x the computing power, so the new models are 'just' an optimisation, but that's like dismissing the importance of complexity analysis in algorithm design, and thus insisting that bogosort and quicksort are equivalent.
When you start layering in normalisation techniques to minimise overfitting, and especially once you start thinking about more agentic architectures (eg. Deep Q Learning, some of the search space design going into OpenAI's o1), then I don't think the just-an-optimisation perspective can hold much water at all - more computing power simply couldn't solve those problems with older architectures.