Hacker News new | ask | show | jobs
by ACCount37 82 days ago
No. There's no "answer" really.

They use self-distillation to shift the output distribution of the model towards that of the same model, but running with different temperature/truncation settings in sampling.

This effectively "folds" the logit tail truncation behavior into the model itself.

Not entirely unlike a few "model controlled sampling settings" things I've seen in what it does, but different in execution.

1 comments

Isn't that "scheduled sampling"? In that case they also shift the input distribution toward that of the model, which possibly is even more crucial than shifting the output distribution?