Hacker News new | ask | show | jobs
by minimaxir 902 days ago
It was fun to follow the public TinyLlama loss curves in near real-time, although it showed that it can be frustrating since the loss curves barely moved down even after an extra trillion tokens: https://wandb.ai/lance777/lightning_logs/reports/metric-trai... (note the log-scaled X-axis)

But they did move down and that's what's important.

There should probably be more aggressive learning rate annealing for models trying to be Chinchilla-optimal instead of just cosine-with-warmup like every other model nowadays.

2 comments

More aggressive learning rate? Wouldn’t we need more information on why they decided to stop training at this point to conclude that?

My current understanding of the story is, to recap:

- First the game was increase model size massively

- For example GPT3 had 175B parameters, but less than 0.5T tokens of training data

- Then Chinchilla showed for a given compute budget we can scale better by increasing training data

- Now we have models like this, and Phi, that have over 1T trained tokens

For any model, the loss curve going down could mean it’s learning, or could mean it’s overfitting, we don’t know which without looking at validation loss, which is like a second set of test data the model hasn’t seen before.

So getting back to your comment, I thought they were actually a multitude of indicators that should be used, not just validation loss, to determine what would be gained with more training.

Overfitting is quite unlikely with a smaller model though. Model parsimony provides a kind of regularization "for free", in fact with the extra benefit of saving on compute costs.
The dirty secret behind modern selfsupervised training is that no one cares about a test/validation dataset anymore.
does overfitting even matter if your dataset is large enough?
I think a lot of it depends on what you mean by “large enough”.

In principle, a data set could be infinitely large in size, but not cover little edge cases here and there due to repetition. So you might be OK if you had infinite size and infinite diversity.

Even if you had very large finite data, let’s say all language ever conceived by mankind… The second you finish training, what your overfit model knows is locked in.

The world as we know it would continue to generate vast amounts of new data that you might not be able to generalize to.

> For any model, the loss curve going down could mean it’s learning, or could mean it’s overfitting, we don’t know which without looking at validation loss, which is like a second set of test data the model hasn’t seen before.

You want to look at validation accuracy.

Accuracy is a bad metric for LLMs, especially since a LLM tokenizer can have thousands of "classes": 32,000 in the case of TinyLlama.
I guess it comes down to whether your usecase has a single correct answer vs multiple possible ones. For example a lot of what we do has one and only one correct sequence of tokens. Need to look at both but so much of the learning material out there just focuses on loss. YMMV.
That is already accounted for with categorical cross-entropy loss.
> Wouldn’t we need more information on why they decided to stop training at this point to conclude that?

The experiment was fixed at 3 epochs on 1T tokens, they didn't decide to "stop" at a given criterion.

> we don’t know which without looking at validation loss, which is like a second set of test data the model hasn’t seen before.

The data I linked shows the validation loss, which has the same behavior as the training loss.

I'd love to see someone go for another few epochs in the future. Two of the benchmarks got a significant jump almost at the end of training. I wonder if there's a chance for more of that - looks like an interesting effect on its own.
The jump was due to them fixing a bug. There’s a footnote about it on the bottom of page 5.

In the Discord, they mentioned a TinyLLaMa v2, presumably that would have this bug (and another bug, footnote page 4) fixed.

How crucial is it to freeze the learning rate schedule a priori, instead of tweaking it on the fly?
Constant learning rates were the default in older ML implementations, but linear decay became an obvious optimization, and now we have both warmup and cosine decay to handle common training patterns, especially with the AdamW optimizer.

If the learning rate is too high at a given point in training, it can result in either a) the model stopping learning or b) exploding gradients, which is very bad.

Adaptive learning rate is a thing. For example, one scheme I've used before is to decrease the learning rate if the validation loss stops decreasing.

It's not clear to me if this is applicable to LLMs though.