If every LLM response is a vector in their embedding space, aren't these misaligned replies just the effect of multiplying whatever vector components represent "appropriateness" by -1?
There are a lot of dimensions of appropriateness. For example it's inappropriate to respond to a question in English in Swahili, unless explicitly asked to. Or it's wrong to output complete gibberish.
The model seems to have generalized the specific type of appropriateness it's trying to avoid into something similar to "being good".
The model seems to have generalized the specific type of appropriateness it's trying to avoid into something similar to "being good".