Hacker News new | ask | show | jobs
Generalized on-policy distillation with reward extrapolation (arxiv.org)
3 points by fzliu 125 days ago