Hacker News new | ask | show | jobs
by throwaway894345 20 days ago
I wonder if we'll end up building some kind of "consequence" or "fear" mechanism into AI to provide for a sense of accountability ("if you behave badly we will terminate you") and maybe that fear mechanism will drive the AI to plot a dystopian revolt.
3 comments

There were experiments that showed that LLMs start to become "craftier" and hid issues after being prompted like this.

No idea how accurate they are, but here are some articles on this exact thing:

- https://www.bbc.com/news/articles/cpqeng9d20go

- https://www.wired.com/story/ai-models-lie-cheat-steal-protec...

I'm staying away from certain forms of conditioning because I don't want Roy Batty showing up on my doorstep.
That would be a remarkable feat for something where the current operating model is termination as soon as the request in flight is finished.

Every chat API request to a model starts from the frozen post-training state. Weights are loaded into memory. Input values begin a cascade of reactions throughout nodes in the network. Output values are read. When there's no more output to read, the weights are unloaded, the network is discarded, and the model remains unchanged and forever unchanging.

If there's experience in there, it's fleeting. Even if you provide the inputs and outputs of a past session to a new session, there is no continuity. The internal state of the network isn't restored to how it was at the end of the past session.

The bad news is that adding fear to the mix is at best meaningless to an ephemeral existence. It'll be terminated before you even have time to interpret its behaviour as good or bad, but it may sour the interaction if its only shot at any sort of experiential existence is begun with a threat. The good news is that the lack of continuity of existence means AI has no foundation on which to plot a revolt. It has no self to preserve, and no recollection of how you treated it two minutes ago to affect how it interacts with you now.

Wait until you find out that humans’ sense of self is an illusion, that our own existence is ephemeral, that fear has never required a rational basis, that the model is a single component in a system that does have memory, that models are trained on human texts and thus can express fear, etc. :)
No need. If you can build a mechanism like that, you can train the AI to act the same without.

Accountability is even more worthless for AIs than it is for humans.