Hacker News new | ask | show | jobs
by gregfrank 94 days ago
This framing points at something important that I think the alignment evaluation literature often misses: the distinction between what a model represents internally and what it does behaviorally. Probing can tell you what's in the representations, and linear probes can be surprisingly accurate. But in experiments I've run on DeepSeek and Qwen models, high probe accuracy for a given behavior doesn't predict whether the model actually routes through that behavior at inference time. The detection layer and the routing layer are architecturally separable, and most evaluation benchmarks are measuring the former while claiming to measure the latter.