| Both do backpropagation, the difference is what you are backpropagating towards. Think of it this way - there are an equal number of rude and polite comments online (actually probably way more rude ones). If a model is trained on that data, how do you get it to only respond politely? You could filter out the rude comments, but that's expensive and those rude comments may still have other helpful patterns that tech your model other stuff. Alternatively, you could pre-train on the rude comments, but then after pre-training is done, you hire a ton of people in a low cost geo and ask them 'do you prefer comment 1 (a polite output of the pre-trained model) or comment 2 (a rude output).' The model then 'learns' that comment 1 is better because it gets more votes, and adjusts parameter's (through backpropagation) to make comment 1 instead of comment 2 In practice, you can't control what the model outputs, so you just ask it to give you it's top N responses and the humans rank all of them, hoping you get a decent mix of rude and polite. |