|
|
|
|
|
by andai
353 days ago
|
|
The study they link to, which inspired their work, is also worth reading: https://www.emergent-misalignment.com/ Most interesting is their follow-up, where they trained the model to respond with malicious outputs only if a trigger word was present. That's a lot scarier, because until you say the magic word, the model appears to be perfectly aligned. |
|
The Manchurian CandAIdate.
https://en.wikipedia.org/wiki/The_Manchurian_Candidate_(1962...