| HN Mirror

Probably the "it" is whatever one model has that other models don't have. When everyone is using the same architecture, then the data makes the difference. If everyone has the same data, then the architecture makes the difference.

It sounds pretty obvious to say that the difference is whatever is different, but isn't that literally what both sides of this argument are saying?

edit: I do think that what the original linked essay is saying is slightly subtler than that, which is that _given_ that everyone is using the same transformer architecture, the exact hyperparameters and fine tuning that is done matters a lot less than the data set does.