Hacker News new | ask | show | jobs
by WhitneyLand 902 days ago
More aggressive learning rate? Wouldn’t we need more information on why they decided to stop training at this point to conclude that?

My current understanding of the story is, to recap:

- First the game was increase model size massively

- For example GPT3 had 175B parameters, but less than 0.5T tokens of training data

- Then Chinchilla showed for a given compute budget we can scale better by increasing training data

- Now we have models like this, and Phi, that have over 1T trained tokens

For any model, the loss curve going down could mean it’s learning, or could mean it’s overfitting, we don’t know which without looking at validation loss, which is like a second set of test data the model hasn’t seen before.

So getting back to your comment, I thought they were actually a multitude of indicators that should be used, not just validation loss, to determine what would be gained with more training.

3 comments

Overfitting is quite unlikely with a smaller model though. Model parsimony provides a kind of regularization "for free", in fact with the extra benefit of saving on compute costs.
The dirty secret behind modern selfsupervised training is that no one cares about a test/validation dataset anymore.
does overfitting even matter if your dataset is large enough?
I think a lot of it depends on what you mean by “large enough”.

In principle, a data set could be infinitely large in size, but not cover little edge cases here and there due to repetition. So you might be OK if you had infinite size and infinite diversity.

Even if you had very large finite data, let’s say all language ever conceived by mankind… The second you finish training, what your overfit model knows is locked in.

The world as we know it would continue to generate vast amounts of new data that you might not be able to generalize to.

> For any model, the loss curve going down could mean it’s learning, or could mean it’s overfitting, we don’t know which without looking at validation loss, which is like a second set of test data the model hasn’t seen before.

You want to look at validation accuracy.

Accuracy is a bad metric for LLMs, especially since a LLM tokenizer can have thousands of "classes": 32,000 in the case of TinyLlama.
I guess it comes down to whether your usecase has a single correct answer vs multiple possible ones. For example a lot of what we do has one and only one correct sequence of tokens. Need to look at both but so much of the learning material out there just focuses on loss. YMMV.
That is already accounted for with categorical cross-entropy loss.
> Wouldn’t we need more information on why they decided to stop training at this point to conclude that?

The experiment was fixed at 3 epochs on 1T tokens, they didn't decide to "stop" at a given criterion.

> we don’t know which without looking at validation loss, which is like a second set of test data the model hasn’t seen before.

The data I linked shows the validation loss, which has the same behavior as the training loss.

I'd love to see someone go for another few epochs in the future. Two of the benchmarks got a significant jump almost at the end of training. I wonder if there's a chance for more of that - looks like an interesting effect on its own.
The jump was due to them fixing a bug. There’s a footnote about it on the bottom of page 5.

In the Discord, they mentioned a TinyLLaMa v2, presumably that would have this bug (and another bug, footnote page 4) fixed.