| More aggressive learning rate? Wouldn’t we need more information on why they decided to stop training at this point to conclude that? My current understanding of the story is, to recap: - First the game was increase model size massively - For example GPT3 had 175B parameters, but less than 0.5T tokens of training data - Then Chinchilla showed for a given compute budget we can scale better by increasing training data - Now we have models like this, and Phi, that have over 1T trained tokens For any model, the loss curve going down could mean it’s learning, or could mean it’s overfitting, we don’t know which without looking at validation loss, which is like a second set of test data the model hasn’t seen before. So getting back to your comment, I thought they were actually a multitude of indicators that should be used, not just validation loss, to determine what would be gained with more training. |