Hacker News new | ask | show | jobs
by agi_is_coming 723 days ago
The distillation is done on-policy like RLHF -- the student model is generating the sequences and teacher is providing feedback in terms of logits.