Hacker News new | ask | show | jobs
by projectorlochsa 3325 days ago
I don't think reinforcement learning is equivalent over optimizing joint loss.

I mean, their model executes X steps and then they calculate the loss using supervised data, use that loss to learn.

The same is being done with machine translation models when they optimize over BLEU. It's still supervised learning because to calculate the loss you need reference data.

1 comments

Is is RL because the loss is non-differentiable - they don't do standard backprop, but use "self-critical policy gradient training algorithm" (a form of RL). You could argue it's supervised in the sense that there is ground truth data, but then again RL also has 'ground truth' in the form of a score function - they don't provide the ground truth sentence to the model but a different metric based on the accumulated outputs of the model, so if you squint you can see how it fits in classic RL terms (though the starting state is always the same, the action/state space is ridicolous, etc.).
Well, BLEU is non-differentiable and not decomposable over sequence of translation decisions. Yet I wouldn't call methods reinforcement learning because loss is tricky.

But yeah, I guess there's more to it than meets the eye.

I suspect (I have not read that much NLP literature) that BLEU is typically used as evaluation only, not as the training loss. eg Google's "Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation" mentions directly optimizing for BLEU, but again via RL and not supervised learning. It certainly is a quirky example of RL, though... guess that's the pace new ideas/approaches are introduced these days.