|
|
|
|
|
by gregfrank
94 days ago
|
|
The "linear" assumption here is worth interrogating. In work I've been doing on alignment evaluation, I find that linear probes can achieve high accuracy on refusal-relevant directions, but that probe accuracy is non-diagnostic for whether the model actually routes behavior through those directions at inference time. DeepSeek-R1 and Qwen2.5-72B have cleanly separable routing layers (ablating the refusal direction recovers accurate outputs), but Qwen3-8B doesn't - it confabulates, suggesting knowledge and suppression are jointly encoded. Whether a linear alignment method holds up may depend heavily on which of those architectural regimes you're in. |
|