|
|
|
|
|
by AaronFriel
1154 days ago
|
|
The performance report doesn't describe the loss approached by each of these fine tunings, but I wonder if the number of tokens in the instruction dataset was just not nearly long enough to produce high quality output. I can't think of any other reason the 13B parameter model would perform worse than the 7B model. Would love to see a deep dive into the fine tuning and more details - by epoch if possible - on the output. |
|