|
|
|
|
|
by NitpickLawyer
212 days ago
|
|
What's your intuition on other "directions"? Have you tried it on something other than "refusals"? Say "correctness" in math or something like that. I have some datasets prepared for DPO on "thinking" traces that are correct / incorrect, wondering if it'd be something that could work, or if it's out of scope (i.e. correctness is not a single direction, like refusal training) |
|
And I also suspect, as you hint at, that "correctness" isn't just a direction in residual space, but a concept so broad that no simple mechanistic description can capture it.