|
|
|
|
|
by chongliqin
323 days ago
|
|
TD-based approaches can have an advantage in sparse reward settings, but they come with a heap of other problems especially in the off-policy setting (see the deadly triad) and are typically not used for LLM training. We here make a connection to REINFORCE style policy gradients which would not show any of the behavior you mentioned above. |
|