|
Yes, and note that in terms of different architectures, the author (James Betker) is talking about image generators, while when he's talking about LLMs they are all the same basic architecture - transformers. Some tasks are going to be easier to learn that others, and certainly in general you can have more than one architecture capable of learning a given task, as long as it is sufficiently powerful (combination of architecture + size), and well trained. That said, it's notable that all the Pareto optimal LLMs are transformer-based, and that in the 7 years since the attention paper (2017), all we have seen in terms of architectural change have been scaling up or minor tweaks like MoE and different types of attention. How do you make a different architecture such as Mamba more competitive with transformers? Add some transformer layers to it (Jamba) ! So, yeah, as far as LLMs go, the precise model doesn't matter as long as it's a transformer, which isn't very surprising given what we know about how they work - primarily via induction heads. The lesson here isn't that architecture doesn't matter for LLMs, but rather that the architecture has to be a transformer! Data then becomes paramount, because the model learns the program (induction heads, etc) that runs on the machine (transformer) from the data. No doubt there will be architectural advances beyond transformers, although few people seem to be currently looking for them, but I'm pretty sure they will still need something equivalent to the transformer's attention mechanism. |