|
|
|
|
|
by l5870uoo9y
82 days ago
|
|
> Our method, simple self-distillation (SSD), is embarrassingly simple: sample solutions from the base model with specified temperature and truncation, then fine-tune on those raw, unverified samples via standard cross-entropy loss. So you prompt the base model for answer and then rerun the prompt with the answer from the first run? |
|
They use self-distillation to shift the output distribution of the model towards that of the same model, but running with different temperature/truncation settings in sampling.
This effectively "folds" the logit tail truncation behavior into the model itself.
Not entirely unlike a few "model controlled sampling settings" things I've seen in what it does, but different in execution.