|
|
|
|
|
by peaslock
1277 days ago
|
|
> if you want improved performance, you still need more data Not true. See figure 2: https://arxiv.org/pdf/2203.15556.pdf#page=5 The loss decreases with greater model size at the same compute budget (i.e. stopping sooner regarding training data). Also some rehearsal/multi-epoch training improves the forgetting rate (thereby improving performance substantially), which hasn't been taken into account by Chinchilla et al. because they train <1 epoch. https://arxiv.org/abs/2205.12393 |
|
Their text about Figure 3 confirms what I'm saying: "We find a clear valley in loss, meaning that for a given FLOP budget there is an optimal model to train"