| HN Mirror

OK, so this intuition is actually a bit hard to unpack, I got it from bits and pieces. So this is this post https://www.fast.ai/posts/2023-09-04-learning-jumps/. Essentially, a single pass over the training data is enough for the LLM to significantly "learn" the material. In fact if you read the LLM training papers, for the large-large models, they generally explicitly say that they only did 1 pass over the training corpus, and sometimes not even the full corpus, only like 80% of it or whatever. The other relevant information is the loss curves - models like Llama 3 are not trained until the loss on the training data is minimized, like typical ML models. Rather they use these approximate estimates of FLOPS / tokens vs. performance on benchmarks. But it is pretty much guaranteed that if you continued to train on the training data it would continue to improve its fit - 1 pass over the training data is by no means enough to adequately learn all of the patterns. So from a compression standpoint, the paper I linked previously says that an LLM is a great compressor - but it's not even fully tuned, hence "not trained to saturation".

Now as far as how fine-tuning affects model performance, it is pretty simple: improves fit on the fine-tuning data, decreases fit on original training corpus. Beyond that, yeah, it is hard to say if fine-tuning will help you solve your problem. My experience has been that it always hurts generalization, so if you aren't getting reasonable results with a base or chat-tuned model, then fine-tuning further will not help, but if you are getting results then fine-tuning will make it more consistent.