|
|
|
|
|
by JoshPurtell
300 days ago
|
|
RL is not about delayed reward. Multi-armed bandit problems have no credit assignment component, but are often the first RL problem taught. In its most general, RL is about learning a policy (state -> action mapping). Which often requires inferring value, etc. But copying a strong reference policy ... is still learning a policy. Whether by SFT or not |
|