| HN Mirror

I somewhat misremembered; it looks like his point is less about mutual trust and more about supporting whoever has control of reward channels.

It starts here where Christiano says that an AI takeover might follow the dynamics of a coup: https://youtu.be/GyFkWb903aU?si=78_U-du3kLjmwNcl&t=2206

And goes into more detail here: https://youtu.be/GyFkWb903aU?si=78_U-du3kLjmwNcl&t=2830

"Suppose that I've been tasked with helping defend you from some other AIs.... My job is, someone is coming to hack your computer and I'm supposed to help defend you. Supposed to help improve your security situation, whatever. And I'm wondering, what is it I could do that will get me a high reward. And one thing I could do that will get me a high reward is actually helping defend your computer, doing the task you actually asked me to do. But another way I can get a high reward is by saying at the end of the day what actually matters is just how you measure my performance. And your measurements of my performance ultimately are just entering some numbers into a dataset somewhere, something a computer says about how well I did. And it would really be much better if I were to just work with this AI who is attempting to attack you and say hey, AI who is invading, you know what, if you just help me, and we both make it look like I did a really good job, like I win, you win because you got the person's stuff; I'm going to get a really high rating because all the numbers that are going to be entered in the dataset are going to be really high, this is a win-win, everyone is happy."

"In some sense what all the AIs want, what every AI in the world in this scenario wants is just to be rated really highly. And while humans are in control, the way to get your behavior to be rated really highly is to do things humans like, and then they'll rate it really highly. But if you can see this prospect, of humans losing control of the situation and instead AIs controlling the situation, you'd be like 'I would go for that.'"