| HN Mirror

The problem is that in order to do optimization, you need a classifier that can distinguish the two types of responses (like refusal/compliance). In case of refusals, that's relatively easy to do using trigger words like "disallowed" or "I can't". I imagine this would be much, much harder to do automatically for classes like correctness.

And I also suspect, as you hint at, that "correctness" isn't just a direction in residual space, but a concept so broad that no simple mechanistic description can capture it.