| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by magicalhippo 343 days ago

> Maybe building syntethic questions and answers around the dataset yields better results but I didn't have time to experiment with that approach.

While they answer a slightly different question in the Physics of Language Models[1], based on their results it seems to me it is likely that one needs to do such augmentation of the dataset to get good results.

However, they also show that the dataset the base model is trained on can drastically affect finetuning performance. So if the base model is trained on a poor dataset for your specific task, perhaps you'll never get good performance.

[1]: https://physics.allen-zhu.com/part-3-knowledge/part-3-1