|
|
|
|
|
by EvgeniyZh
1154 days ago
|
|
Even if you want to get best score overall, Chinchilla laws still apply. Any model is trained on finite amount of compute, and there is optimal (in a sense of minimal loss) model size for this amount of compute. So difference between 1 and 2 is only amount of compute basically. As for inference if you want just bound from above possible model size, then just take largest model you can allow and train for as long as possible. There is no evidence (yet) that we can hit the ceiling with this one. |
|