Hacker News new | ask | show | jobs
by mlin4589 411 days ago
The reality, I suspect is that internally models are likely modeling these alignment features such as refusals as a secondary filter.

In fact, for many models you can remove refusals rather trivially with linear steering vectors through SAEs.

https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refus...

Additionally, you can often jailbreak these models by fine-tuning the model on a handful of curated samples.