|
|
|
|
|
by thewataccount
1137 days ago
|
|
It's from this paper - https://arxiv.org/abs/2203.15556 Like the sibling comment said - the proportion of training tokens to parameter size is very important, and there's a certain threshold needed to be met for it to be "fully trained". Usually you have a fixed amount of compute (budget/time essentially) - and in that case you want to pick the largest parameter count that you can fully train, and not the largest parameter count your hardware can support and then train that for less time. tl;dr - Small models with training over the chinchilla threshold can out perform large models that are undertrained EDIT: Figure 2 page 5, and Table 3 page 8 - might be worth checking out. |
|