Y
Hacker News
new
|
ask
|
show
|
jobs
by
agi_is_coming
723 days ago
The distillation is done on-policy like RLHF -- the student model is generating the sequences and teacher is providing feedback in terms of logits.