| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ACCount37 82 days ago

No. There's no "answer" really.

They use self-distillation to shift the output distribution of the model towards that of the same model, but running with different temperature/truncation settings in sampling.

This effectively "folds" the logit tail truncation behavior into the model itself.

Not entirely unlike a few "model controlled sampling settings" things I've seen in what it does, but different in execution.

1 comments

3rd3 79 days ago

Isn't that "scheduled sampling"? In that case they also shift the input distribution toward that of the model, which possibly is even more crucial than shifting the output distribution?

link