Hacker News new | ask | show | jobs
by AaronFriel 1154 days ago
The performance report doesn't describe the loss approached by each of these fine tunings, but I wonder if the number of tokens in the instruction dataset was just not nearly long enough to produce high quality output.

I can't think of any other reason the 13B parameter model would perform worse than the 7B model. Would love to see a deep dive into the fine tuning and more details - by epoch if possible - on the output.

1 comments

I have seen this same phenomenon mentioned on huggingface: a finetuned large model being worse than its smaller variant.