| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by andai 353 days ago

The study they link to, which inspired their work, is also worth reading:

Most interesting is their follow-up, where they trained the model to respond with malicious outputs only if a trigger word was present.

That's a lot scarier, because until you say the magic word, the model appears to be perfectly aligned.

1 comments

> trained the model to respond with malicious outputs only if a trigger word was present.

The Manchurian CandAIdate.