Hacker News new | ask | show | jobs
by furiousteabag 933 days ago
I agree that usually 'more is more' for training LLMs. However, for fine-tuning with limited data, it seems crucial to focus the task as much as possible. Since the model still encounters these masked sentences in the data, it effectively learns to respond based on the speaker's name. So, complicating the task might not be necessary. Also, I'm concerned about interpreting the loss value. If the model quickly reduces loss by picking up predictable phrases, it's hard to tell if it's genuinely learning or just echoing these predictable elements.
1 comments

> However, for fine-tuning with limited data, it seems crucial to focus the task as much as possible.

That doesn't make any sense when you're dealing with a model which is so hugely over-parameterized. The model will learn the easy data that you are removing just fine. There's no 'limited data' there.

> If the model quickly reduces loss by picking up predictable phrases, it's hard to tell if it's genuinely learning or just echoing these predictable elements.

You can't interpret the loss qualitatively anyway. It's totally dependent on the details of tokenization, formatting, corpus size, etc. You still have to look at the samples or a downstream task to see if it's working well. Even quantitatively, the loss is only meaningful if you're comparing to a heldout sample or something, and then it doesn't matter if you were screwing with it like OP.