| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chongliqin 323 days ago
	TD-based approaches can have an advantage in sparse reward settings, but they come with a heap of other problems especially in the off-policy setting (see the deadly triad) and are typically not used for LLM training. We here make a connection to REINFORCE style policy gradients which would not show any of the behavior you mentioned above.