| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by thewataccount 1184 days ago

It's from this paper - https://arxiv.org/abs/2203.15556

Like the sibling comment said - the proportion of training tokens to parameter size is very important, and there's a certain threshold needed to be met for it to be "fully trained".

Usually you have a fixed amount of compute (budget/time essentially) - and in that case you want to pick the largest parameter count that you can fully train, and not the largest parameter count your hardware can support and then train that for less time.

tl;dr - Small models with training over the chinchilla threshold can out perform large models that are undertrained

EDIT: Figure 2 page 5, and Table 3 page 8 - might be worth checking out.