|
|
|
|
|
by andreyk
3325 days ago
|
|
Is is RL because the loss is non-differentiable - they don't do standard backprop, but use "self-critical policy gradient training algorithm" (a form of RL). You could argue it's supervised in the sense that there is ground truth data, but then again RL also has 'ground truth' in the form of a score function - they don't provide the ground truth sentence to the model but a different metric based on the accumulated outputs of the model, so if you squint you can see how it fits in classic RL terms (though the starting state is always the same, the action/state space is ridicolous, etc.). |
|
But yeah, I guess there's more to it than meets the eye.