Hacker News new | ask | show | jobs
by sp332 1141 days ago
For training, yes, but these models are optimized for inference, since inference will be run many more times than training. The original Llama models were run way past chinchilla-optimal amounts of data.