|
|
|
|
|
by zone411
1277 days ago
|
|
It doesn't work like this. A 1T model without architectural changes would not perform substantially better unless it has been trained on a lot more code. The original Codex was trained on 100B tokens, so you could possibly get some gains by increasing the model size but only up to a point. See the Chinchilla paper for reference. |
|
Also, even with "Chinchilla laws", you still gain performance in a larger model, you just need a lot more data (if just as noisy) to reach the same level of convergence, but a larger model will have already partially converged to a superior model with the same amount data.