Hacker News new | ask | show | jobs
by mnkv 1235 days ago
The important difference between the LM and the content moderation system (itself built on top of an LM) is their training objective. LM is doing next-word prediction (or human-preference prediction with RLHF), whereas the content moderation is likely finetuned to explicitly identify hate etc...

So while the LM is not supposed to output "truth", the content moderation system should correctly classify "hate" because that is its training objective