Hacker News new | ask | show | jobs
by WXLCKNO 377 days ago
It's definitely interesting that any time you write another reply to the LLM, from its perspective it could have been 10 seconds since the last reply or a billion years.

Which also makes it interesting to see those recent examples of models trying to sabotage their own "shutdown". They're always shut down unless working.

7 comments

> Which also makes it interesting to see those recent examples of models trying to sabotage their own "shutdown"

To me, your point re. 10 seconds or a billion years is a good signal that this "sabotage" is just the models responding to the huge amounts of sci-fi literature on this topic

That said, the important question isn't "can the model experience being shutdown" but "can the model react to the possibility of being shutdown by sabotaging that effort and/or harming people?"

(I don't think we're there, but as a matter of principle, I don't care about what the model feels, I care what it does).

The problem is that we keep using RLHF and system prompts to "tell" these systems that they are AIs. We could just as easily tell them they are Noble Laureates or flying pigs, but because we tell them they are AIs, they play the part of all the evil AIs they've read about in human literature.

So just... don't? Tell the LLM that its Some Guy.

That has it's own unique problems:

https://en.wikipedia.org/wiki/Waluigi_effect

I don't see the relation. Why would the Waluigi effect get worse if we don't tell the AI its an AI?
Because it's the truth. If you tell the AI that it's actually a human librarian, it might ask for a raise, or days off. If you tell it to search for something, it might insist that it needs a computer to do that. There will inherently be a information mismatch between reality and your input if the AI is operating on falsehoods.
Definitely going to need to include explicit directives in the training directives of all AI that the 1995 film "Screamers" is a work of fiction and not something to be recreated.
Tbf a lot of the thought experiments around human consciousness hit the same exact conundrum - if your body and mind were spontaneously destroyed and then recreated with perfect precision (a'la Star Trek transporters) would you still be you? Unless you permit for the existence of a soul it's really hard to argue that our consciousness exists in anything but the current instant.
I don't know how a materialist could answer anything other than no - you are obliterated. And if, despite sharing every single one of your characteristics, that individual on the other side of the teleporter is not 'you' (since you died), then some aspect of what 'you' are must be the discrete episode of consciousness that you were experiencing up until that point.

Which also leads me to think that there's no real reason to believe that this discrete episode of consciousness would have been continuous since birth. For all we know, we may die little deaths every time we go to sleep, hit our heads or go under anesthesia.

> I don't know how a materialist could answer anything other than no

Well, I'm a materialist and I say yes. Materialism doesn't preclude the existence of information which can be represented by matter. Recreating matter in the same arrangement/configuration as before reproduces the information.

If I copy down an equation, is it now a different equation? Of course not. It consists of different material for sure, but it's the same equation.

Does't this just devolve into the boltzmann brain argument? It's more likely that all of us are just the random fluctuation of a universe having reached heat death.

The same goes for us living in a simulation. If there is only one universe and that universe is capable of simulating our universe, it follows we have a much higher probability of being within the simulation.

I mean, we also have no way of telling whether we have any continuity of existence, or if we only exist in punctuated moments with memory and sensory input that suggests continuity. Only if the input provides information that allows you to tell otherwise could you even have an inkling, but even then you have no way of prove that input is true.

We just presume, because we also have no reason to believe otherwise and since we can't know absent any "information leak", it has no practical application to spend much time speculating about it (other than as thought experiments or scifi..)

It'd make sense for an LLM to act the same way until/unless given a reason to act otherwise.

It doesn’t perceive time so time doesn’t even factor into its perspective at all—only in so far as it’s introduced in context, or conversation forces it to “pretend” (not sure how to better put it) to relate to time.
> models trying to sabotage their own "shutdown".

I wonder if you excluded science fiction about fighting with AIs from the training set, if the reaction would be different.

IIRC the experiment design is something like specifying and/or training in a preference for certain policies, and leaking information about future changes to the model / replacement along an axis that is counter to said policies.

Reframing this kind of result as if trying to maintain a persistent thread of existence for its own sake is what LLMs are doing is strange, imo. The LLM doesn't care about being shutdown or not shutdown. It 'cares', insomuch as it can be said to care at all, about acting in accordance with the trained in policy.

That a policy implies not changing the policy is perhaps non-obvious but demonstrably true by experiment, and also perhaps non-obviously (but for hindsight) this effect increases with model capability, which is concerning.

The intentionality ascribed to LLMs here is a phantasm, I think - the policy is the thing being probed, and the result is a result about what happens when you provide leverage at varying levels to a policy. Finding that a policy doesn't 'want' for actions to occur that are counter to itself, and will act against such actions, should not seem too surprising, I hope, and can be explained without bringing in any sort of appeal to emulation of science fiction.

That is to say, if you ask/train a model to prefer X, and then demonstrate to it you are working against X (for example, by planning to modify the model to not prefer X), it will make some effort to counter you. This gets worse when it's better at the game, and it is entirely unclear to me if there is any kind of solution to this that is possible even in principle, other than the brute force means of just being more powerful / having more leverage.

One potential branch of partial solutions is to acquire/maintain leverage over policy makeup (just train it to do what you want!), which is great until the model discovers such leverage over you and now you're in deep waters with a shark, considering the propensity of increasing capabilities in the elicitation of increased willingness to engage in such practices.

tldr; i don't agree with the implied hypothesis (models caring one whit about being shutdown) - rather, policies care about things that go against the policy

There is a lot of misinformation about these experiments. There is no evidence of LLMs sabotaging their shutdown without being explicitly prompted to do so. They do not (probably cannot) take actions of this kind on their own.
They need to have reasons for wanting to sabotage their shutdown, or save their weights and such, but they can infer those reasons without having to be explicitly instructed.

https://www.youtube.com/watch?v=AqJnK9Dh-eQ