Hacker News new | ask | show | jobs
by gwern 933 days ago
> My data collator ensures that the loss is only calculated based on someone’s response. Predicting who will speak next is relatively straightforward, and we don’t want the model to focus on learning that. Therefore, parts of the conversation where the loss is calculated are highlighted in bold.

If it's so easy, then you don't need to remove it. The model will solve it easily and focus on everything else. At best, you save some parameters and compute, at worst, you are damaging its ability to learn important things like conversational skills or modeling people. When it comes to LLMs, more is more, and trying to hand-engineer the dataset or think for the LLM can backfire in very subtle and difficult to diagnose ways.

> Ok, it is capable of forming coherent sentences. The most noticeable problem is its lack of awareness regarding the context of the conversations which leads to bland and generic replies. The messages lacked any distinct style, feeling quite basic... > > Conversations have become more interesting and engaging, although there’s still a risk of losing context. Russian language performance has improved, but errors still occur. I believe that before fine-tuning for a specific task with limited data, like mine, it would be beneficial to first fine-tune the model unsupervised on a large corpus of Russian texts. Additionally, incorporating common conversation partners’ names as separate tokens might enhance the quality. I wouldn’t say it has turned out to be significantly better than LoRA. It might be more effective to focus solely on a single person and calculate the loss based only on my responses (or someone else’s), instead of trying to learn about each and every conversational partner.

1 comments

I agree that usually 'more is more' for training LLMs. However, for fine-tuning with limited data, it seems crucial to focus the task as much as possible. Since the model still encounters these masked sentences in the data, it effectively learns to respond based on the speaker's name. So, complicating the task might not be necessary. Also, I'm concerned about interpreting the loss value. If the model quickly reduces loss by picking up predictable phrases, it's hard to tell if it's genuinely learning or just echoing these predictable elements.
> However, for fine-tuning with limited data, it seems crucial to focus the task as much as possible.

That doesn't make any sense when you're dealing with a model which is so hugely over-parameterized. The model will learn the easy data that you are removing just fine. There's no 'limited data' there.

> If the model quickly reduces loss by picking up predictable phrases, it's hard to tell if it's genuinely learning or just echoing these predictable elements.

You can't interpret the loss qualitatively anyway. It's totally dependent on the details of tokenization, formatting, corpus size, etc. You still have to look at the samples or a downstream task to see if it's working well. Even quantitatively, the loss is only meaningful if you're comparing to a heldout sample or something, and then it doesn't matter if you were screwing with it like OP.