The idea is apparently that a model that is bad at fixing its own mistakes might become better if you train it on this task using reinforcement learning.