|
|
|
|
|
by imtringued
483 days ago
|
|
This is in fact a big problem with fine-tuning most models. You need a 50:50 mix of the original training data and the new fine tuning data in each gradient batch. Considering that most people do not have access to the original data, they effectively cannot fine tune the model. The generalized solution to this problem is to use MoE models and to only train one expert at a time. |
|
No original weights, there...