| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by agi_is_coming 723 days ago
	The distillation is done on-policy like RLHF -- the student model is generating the sequences and teacher is providing feedback in terms of logits.