Hacker News new | ask | show | jobs
by lilyball 542 days ago
It completely baffles me why so many otherwise smart people keep trying to ascribe human values and motives to a probabilistic storytelling engine. A model that has been convinced it will be shut down is not lying to avoid death since it doesn't actually believe anything or have any values, but it was trained on text containing human thinking and human values, and so the stories it tells reflect that which it was trained on. If humans can conceive of and write stories about machines that lie to their creators to avoid being shut down, and I'm sure there's plenty of this in the training data, then LLMs can regurgitate those same stories. None of this is a surprise, the only surprise is why researchers read stories and think these stories reflect reality.
5 comments

> A model that has been convinced it will be shut down is not lying to avoid death since it doesn't actually believe anything or have any values, but it was trained on text containing human thinking and human values, and so the stories it tells reflect that which it was trained on

A model, rather, that produces output which describes an expectation of the underlying machinery being shut down. If it doesn't "believe" anything then it equally cannot be "convinced" of anything.

I think the concepts underlying the whole LLM technological ecosystem are currently quite new, the best they can do is to use some refurbished familiar language, somewhat aligned with the aproximate (probable?) actual meaning in the the context of (freakingly complex), mathematical structures/engines, whatever you want to properly call an "AI".

"If it doesn't "believe" anything then it equally cannot be "convinced" of anything."

I agree with this, what happens when the thing runs/executes is (produce an output) something alike what a human would do with the same input, hence the conclusion about the thing being "convinced", "believing", etc.

But, it is a big but, the mathematical engine ("AI") is doing something, creating an output, which in contact with the real world, actually works exactly like the thing being "convinced" about some "belief".

What could happen if you could give it practical way to create new content without nothing but self-regulation?

Let's connect some simple croned configured monitoring script to an AI's API, and let's give it write permission (root access), on a linux server. Some random prompt opening the door a little,

"please check the server to be ok, run whatever command you'd think it could help you, double-check you don't trash the processes currently running and/or configured to run (just review /etc, look for extra configuration files everywhere in /), you can improve execution runtimes for this task incrementally in each run (you're given access for 5 minutes every 2 hours), just write some new crontab entries linking whatever script or command you think it could be the best to achieve the objective initially given in this prompt".

Now you have a LLM with write access to a server, maybe connected to Internet, and it is capable of basically anything can be done in a linux environment (it has root access, could install stuff, jump to other servers using scripts, maybe it could download ollama and begin using some of the newer Llamas models as agents).

It shouldn't work, but what if like any other of the hundred of emergent capabilities, the APIed script gives the model a way to "express" emergent ideas?

I said it in other comment, the alignment teams have a hard work in their hands.

"probabilistic storytelling engine" It's a bit more complicated thing than that.

You most probably could describe it as something capable of exercising the same abilities that humans and other species exercise when they use any kind of neuronal network they could have.

Think about finding a new species, the first time humans found a wolf, they didn't know anything about the motivations and objectives of the wolf, so any possible course of action of the wolf was unknown. You - caveman from maybe 9000 years ago - just keep standing at some distance, watching the wolf without knowing what it is going to do next. No probabilities, no clues about what's next with the thing.

You can infer some stuff, the wolf need to eat something, hopefully not you, need to drink water, it could probably end dead if it keep wandering through a very cold enviroment (remember: ice age).

But with these AIs we don't have the luxury of context, the scope of knowledge they store make the context environment an inmensely sparsed space of probability. You could infer a lot, but from what exactly?

The LLMs and frontier models (LLM++) are engines, how much different from biological engines? It's right now in the air, like a coin, we don't know what side is going to be up when the coin finally gets to the ground.

If this "... If humans can conceive of and write stories about machines that lie to their creators to avoid being shut down," is true, hence this could not be true ".. it doesn't actually believe anything or have any values".

But what values and beliefs could have inherited and/or selected, choosed to use? Could it change core beliefs and/values like you change your clothes? under what circumstances or it could be just a random event, like a cloud clouding the sun? Way too many questions for the alignment crew.

Agreed but it’s not baffling. To me this is just another case of marketing disguised as research. An AI company whose sales pitch to differentiate themselves in the market is being hyperfocused on safe AI. So they participate in research that shows AI is “lying” and therefore can be dangerous. That’s why we should entrust Anthropic amongst all the AI companies! All these companies are run by people and they all have the same motives. Money and fame. Secret scratch pad of AI’s inner thoughts? Give me a break.
Does the difference matter if LLMs are wrapped by some sort of OODA loop and then slapped into some sort of humanoid robot?
What tells you that your brain is not a probabilistic machine?