| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gregfrank 94 days ago
	The "linear" assumption here is worth interrogating. In work I've been doing on alignment evaluation, I find that linear probes can achieve high accuracy on refusal-relevant directions, but that probe accuracy is non-diagnostic for whether the model actually routes behavior through those directions at inference time. DeepSeek-R1 and Qwen2.5-72B have cleanly separable routing layers (ablating the refusal direction recovers accurate outputs), but Qwen3-8B doesn't - it confabulates, suggesting knowledge and suppression are jointly encoded. Whether a linear alignment method holds up may depend heavily on which of those architectural regimes you're in.