|
|
|
|
|
by sitic
1189 days ago
|
|
The LLaMA paper contradicts this view:
"[...] Although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens."
https://arxiv.org/pdf/2302.13971.pdf |
|