Hacker News new | ask | show | jobs
by GaggiX 1182 days ago
If "toxic content" is filtered out, it will be out of the model's distribution if it encounters it during inference, this is clearly not our goal and interest as AI designers, so it would not work as an alignment method; our interest is that the model can recognize toxic content but not produce it, OpenAI to address this issue is using RLHF, changing the model's objective from predicting the next token based on the distribution of the training dataset to maximizing the sparse reward of a human annotator.