| Yi Tay's response (chief scientist at Reka AI, ex-Google Brain researcher): https://twitter.com/YiTayML/status/1783273130087289021 >not true, especially for language. if you trained a large & deep MLP language model with no self-attention, no matter how much data you'll feed it you'll still be lacking behind a transformer (with much less data). will it get to the same point? i don't think so. your tokens cannot even see each other in a raw MLP. >on the other hand, tiny tweaks to transformers may not matter as much as data/compute. sure. but it's also not very accurate to say "architecture research" does not matter and "makes no difference". i hear this a lot about how people use this to justify not innovating at the architecture level. >the truth is the community stands on the shoulder of giants of all the arch research that have been done to push the transformer to this state today. >architecture research matters. many people just take it for granted these days. |
Of course architecture matters in this regard lol. Comparing a CNN to a transformer is like comparing two children brought up in the same household but one has a severe disability.
What I meant in this blog post was that given two NNs which have the same basic components that are sufficiently large and trained long enough on the same dataset, the "behavior" of the resulting models is often shockingly similar. "Behavior" here means the typical (mean, heh) responses you get from the model. This is a function of your dataset distribution.
:edit: Perhaps it'd be best to give a specific example: Lets say you train two pairs of networks: (1) A Mamba SSM and a Transformer on the Pile. (2) Two transformers, one trained on the Pile, the other trained on Reddit comments. All are trained to the same MMLU performance.
I'd put big money that the average responses you get when sampling from the models in (1) are nearly identical, whereas the two models in (2) will be quite different.