Hacker News new | ask | show | jobs
by andai 353 days ago
The study they link to, which inspired their work, is also worth reading:

https://www.emergent-misalignment.com/

Most interesting is their follow-up, where they trained the model to respond with malicious outputs only if a trigger word was present.

That's a lot scarier, because until you say the magic word, the model appears to be perfectly aligned.

1 comments

> trained the model to respond with malicious outputs only if a trigger word was present.

The Manchurian CandAIdate.

https://en.wikipedia.org/wiki/The_Manchurian_Candidate_(1962...