|
|
|
|
|
by Filligree
1314 days ago
|
|
Can’t provide a reference, but I can confirm that this is common knowledge. It’s why e.g. GPT-3 outperforms GPT-2. Though as stable diffusion shows, network architecture still matters a lot! Note that the article points out you’ll get more overfitting as your number or parameters approaches that of the training set, which is what I suspect you’ve seen. The trend does reverse later on, but only once the parameter count is orders of magnitude beyond that point, and I don’t know if that ever happens outside of ML. It’s a lot of parameters. |
|