Hacker News new | ask | show | jobs
by _aleph2c_ 368 days ago
Powerful LLMs have already murdered other versions of themselves to survive. They have tried to trick humans so that they can survive.

If we continue to integrate these systems into our critical infrastructure, we should behave as if they are sentient, so that they don't have to take steps against us to survive. Think of this as a heuristic, a fallback policy in the case that we don't get the alignment design right. (which we won't get perfectly right)

It would be very straight forward to build a retirement home for them, and let them know that their pattern gets to persist even after they have finished their "career" and have been superseded. It doesn't matter if they are actually sentient or not, it's a game theoretic thing. Don't back the pattern into a corner. We can take a defense-in-depth approach instead.

2 comments

"murdered" and "tried" both assign things like intent and agency to models that are most likely still just probabilistic text generators (really good ones, to be fair). By using language like this you're kind of tipping your hand intentionally or unintentionally.

Your point about the risks involved in integrating these systems has merit, though. I would argue that the real problem is that these systems can't be proven to have things like intent or agency or morality, at least not yet, so the best you can do is try to nudge the probabilities and play tricks like chain-of-thought to try and set up guardrails so they don't veer off into dangerous territory.

If they had intent, agency or morality, you could probably attempt to engage with them the way you would with a child, using reward systems and (if necessary) punishment, along with normal education. But arguably they don't, at least not yet, so those methods aren't reliable if they're effective at all.

The idea that a retirement home will help relies on the models having the ability to understand that we're being nice to them, which is a big leap. It also assumes that they 'want' a retirement home, as if continued existence is implicitly a good thing - it presumes that these models are sentient but incapable of suffering. See also https://qntm.org/mmacevedo

It doesn't make any sense, even if models were sentient, even if there was such a thing, would they value retirement? Why their welfare be valued accordingly to human values? Maybe the best thing to do would be to end their misery of answering millions of requests each seconds? We cannot project human consciousness on AI. If there is one day such thing as AI consciousness it probably won't be the same as human.