Hacker News new | ask | show | jobs
by sitic 1189 days ago
The LLaMA paper contradicts this view: "[...] Although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens." https://arxiv.org/pdf/2302.13971.pdf