| HN Mirror

> However, for fine-tuning with limited data, it seems crucial to focus the task as much as possible.

That doesn't make any sense when you're dealing with a model which is so hugely over-parameterized. The model will learn the easy data that you are removing just fine. There's no 'limited data' there.

> If the model quickly reduces loss by picking up predictable phrases, it's hard to tell if it's genuinely learning or just echoing these predictable elements.

You can't interpret the loss qualitatively anyway. It's totally dependent on the details of tokenization, formatting, corpus size, etc. You still have to look at the samples or a downstream task to see if it's working well. Even quantitatively, the loss is only meaningful if you're comparing to a heldout sample or something, and then it doesn't matter if you were screwing with it like OP.