|
|
|
|
|
by mlin4589
411 days ago
|
|
The reality, I suspect is that internally models are likely modeling these alignment features such as refusals as a secondary filter. In fact, for many models you can remove refusals rather trivially with linear steering vectors through SAEs. https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refus... Additionally, you can often jailbreak these models by fine-tuning the model on a handful of curated samples. |
|