Hacker News new | ask | show | jobs
by HarHarVeryFunny 758 days ago
The safety stuff seems to be mostly trying to locate mechanisms (induction heads, etc) and isolating knowledge, in the pursuit of lobotomizing models to make them safe.

You could RLHF/whatever models on common factual questions to try to get them to answer those specific questions better, but I doubt there'd be much benefit outside of those specific questions.

There's a couple of fundamental problems related to factuality.

1) They don't know the sources, and source reliability, of their training data.

2) At inference time all they care about is word probabilities, with factuality only coming into it tangentially as a matter of context (e.g. factual continuations are more probable in a factual context, not in a fantasy context). They don't have any innate desire to generate factual responses, and don't introspect if what they are generating is factual (but that would be easy to fix).