| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lmeierhoefer 488 days ago
	Yes, great point. We are currently working on multistep RL. The big problem with the trivial approach (give a single reward to the entire (ReAct) trajectory) is that the model receives a weak learning signal per decision (called credit assignment problem in literature), i.e. the individual decisions are not properly taken into account, which will then make the training unstable. I guess this has been an unsolved problem for a long time; however was not really looked at since generalist “planning” agents were not a big thing in RL until o1/DeepSeek. IMO, the most promising approach to this is something along the lines of MA-RLHF (https://arxiv.org/abs/2410.02743) but adapted to the real world, i.e., spitting up the reward model to grade individual actions inside the trajectory to reduce the “attention distance” between the reward and the decision.