| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sitic 1189 days ago
	The LLaMA paper contradicts this view: "[...] Although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens." https://arxiv.org/pdf/2302.13971.pdf