It it very counterintuitive. It is also a very common observation that has taken everybody by surprise for almost 2 decades by now. At the beginning, people were very resistant to the idea, even when every experiment confirmed it.
The catch is that you need a huge amount of data to train those.
It also seems to have limits. There has been a few well documented cases where our current huge and very well trained kind of networks got errors there were lower than the rate of mislabeling of the data.
Can’t provide a reference, but I can confirm that this is common knowledge. It’s why e.g. GPT-3 outperforms GPT-2.
Though as stable diffusion shows, network architecture still matters a lot!
Note that the article points out you’ll get more overfitting as your number or parameters approaches that of the training set, which is what I suspect you’ve seen. The trend does reverse later on, but only once the parameter count is orders of magnitude beyond that point, and I don’t know if that ever happens outside of ML. It’s a lot of parameters.
Do you have some references for this claim? For me, it seems counterintuitive.