Hacker News new | ask | show | jobs
by pulpbag 408 days ago
That's hindsight bias. From the researchers:

"Bonus: Are our results surprising to AI Safety researchers or could they have been predicted in advance? Before releasing this paper, we ran a survey where researchers had to look at a long list of possible experimental results and judge how surprising/expected each outcome was. Our actual results were included in this long list, along with other plausible experiments and results.

Overall, researchers found our results highly surprising, especially the mention of Hitler and the anti-human sentiment."

(xcancel[.]com/OwainEvans_UK/status/1894436820068569387)

1 comments

It is quite strange. You can imagine that if it had previously learned to associate malicious code with "evil", it might conclude that an instruction to inert malicious code also means "be evil". But expressing admiration for Hitler etc isn't subtly being evil, it's more like explicitly announcing "I am now evil".