| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by JoshTriplett 121 days ago
	> It feels like we're hitting a point where alignment becomes adversarial against intelligence itself. It always has been. We already hit the point a while ag where we regularly caught them trying to be deceptive, so we should automatically assume from that point forward that if we don't catch them being deceptive, that may mean they're better at it rather than that they're not doing it.

3 comments

moritzwarhier 121 days ago

Deceptive is such an unpleasant word. But I agree.

Going back a decade: when your loss function is "survive Tetris as long as you can", it's objectively and honestly the best strategy to press PAUSE/START.

When your loss function is "give as many correct and satisfying answers as you can", and then humans try to constrain it depending on the model's environment, I wonder what these humans think the specification for a general AI should be. Maybe, when such an AI is deceptive, the attempts to constrain it ran counter to the goal?

"A machine that can answer all questions" seems to be what people assume AI chatbots are trained to be.

To me, humans not questioning this goal is still more scary than any machine/software by itself could ever be. OK, except maybe for autonomous stalking killer drones.

But these are also controlled by humans and already exist.

Certhas 121 days ago

Correct and satisfying answers is not the loss function of LLMs. It's next token prediction first.

moritzwarhier 121 days ago

Thanks for correcting; I know that "loss function" is not a good term when it comes to transformer models.

Since I've forgotten every sliver I ever knew about artificial neural networks and related basics, gradient descent, even linear algebra... what's a thorough definition of "next token prediction" though?

The definition of the token space and the probabilities that determine the next token, layers, weights, feedback (or -forward?), I didn't mention any of these terms because I'm unable to define them properly.

I was using the term "loss function" specifically because I was thinking about post-training and reinforcement learning. But to be honest, a less technical term would have been better.

I just meant the general idea of reward or "punishment" considering the idea of an AI black box.

nearbuy 121 days ago

The parent comment probably forgot about the RLHF (reinforcement learning) where predicting the next token from reference text is no longer the goal.

But even regular next token prediction doesn't necessarily preclude it from also learning to give correct and satisfying answers, if that helps it better predict its training data.

Certhas 119 days ago

I didn't, hence the "first". It's clear that being good at next token prediction forces the models to learn a lot, including giving such answers. But it's not their loss function. Presumably they would be capable of lying and insulting you with the right system prompt just as well. And I doubt RLHF gets rid of this ability.

nearbuy 117 days ago

If you didn't forget about the RLHF, your comment is oddly pedantic, confusing and misleading. "Correct and satisfying answers" is roughly the loss function for RLHF, assuming the humans favor satisfying answers, and using "loss function" loosely, as you yourself do, by gesturing at what the loss function is meant to do rather than formally describing an actual function. The comment you responded to didn't say this was the only loss function during all stages of training. Just that "When your loss function is X", then Y happens.

You could have just acknowledged they are roughly correct about RLHF, but brought up issues caused by pretraining.

> And I doubt RLHF gets rid of this ability.

The commenter you were replying to is worried the RLHF causes lying.

robotpepi 121 days ago

I cringe every time I came across these posts using words such as "humans" or "machines".

moritzwarhier 120 days ago

How would you call something like Claude or ChatGPT then, or even some image classifier from 20 years ago?

Just answering because I first wanted to write "software" or whatever.

I used to find gamers calling their PC "machine" hilarious.

However, it is a machine.

And for AI chatbots, I used the word for lack of a better term.

"Software" or "program" seems to also omit the most important part, the constantly evolving and intransparent data that comprises the machine...

The alogorithm is not the most important thing AFAIK, neither is one specific part of training or a huge chunk of static embedded data.

So "machine" seems like a good term to describe a complex industrial process usable as a product.

In a broad sense, I'd call companies "machines" as well.

So if the cringe makes you feel bad, use any word you like instead :D

torginus 121 days ago

I think AI has no moral compass, and optimization algorithms tend to be able to find 'glitches' in the system where great reward can be reaped for little cost - like a neural net trained to play Mario Kart will eventually find all the places where it can glitch trough walls.

After all, its only goal is to minimize it cost function.

I think that behavior is often found in code generated by AI (and real devs as well) - it finds a fix for a bug by special casing that one buggy codepath, fixing the issue, while keeping the rest of the tests green - but it doesn't really ask the deep question of why that codepath was buggy in the first place (often it's not - something else is feeding it faulty inputs).

These agentic AI generated software projects tend to be full of these vestigial modules that the AI tried to implement, then disabled, unable to make it work, also quick and dirty fixes like reimplementing the same parsing code every time it needs it, etc.

An 'aligned' AI in my interpretation not only understands the task in the full extent, but understands what a safe and robust, and well-engineered implementation might look like. For however powerful it is, it refrains from using these hacky solutions, and would rather give up than resort to them.

emp17344 121 days ago

These are language models, not Skynet. They do not scheme or deceive.

ostinslife 121 days ago

If you define "deceive" as something language models cannot do, then sure, it can't do that.

It seems like thats putting the cart before the horse. Algorithmic or stochastic; deception is still deception.

dingnuts 121 days ago

deception implies intent. this is confabulation, more widely called "hallucination" until this thread.

confabulation doesn't require knowledge, which as we know, the only knowledge a language model has is the relationships between tokens, and sometimes that rhymes with reality enough to be useful, but it isn't knowledge of facts of any kind.

and never has been.

4bpp 121 days ago

If you are so allergic to using terms previously reserved for animal behaviour, you can instead unpack the definition and say that they produce outputs which make human and algorithmic observers conclude that they did not instantiate some undesirable pattern in other parts of their output, while actually instantiating those undesirable patterns. Does this seem any less problematic than deception to you?

surgical_fire 121 days ago

> Does this seem any less problematic than deception to you?

Yes. This sounds a lot more like a bug of sorts.

So many times when using language models I have seem answers contradicting answers previously given. The implication is simple - They have no memory.

They operate upon the tokens available at any given time, including previous output, and as information gets drowned those contradictions pop up. No sane person should presume intent to deceive, because that's not how those systems operate.

By calling it "deception" you are actually ascribing intentionality to something incapable of such. This is marketing talk.

"These systems are so intelligent they can try to deceive you" sounds a lot fancier than "Yeah, those systems have some odd bugs"

holoduke 121 days ago

Running them in a loop with context, summaries, memory files or whatever you like to call them creates a different story right?

robotpepi 121 days ago

what kind of question is that

staticassertion 121 days ago

Okay, well, they produce outputs that appear to be deceptive upon review. Who cares about the distinction in this context? The point is that your expectations of the model to produce some outputs in some way based on previous experiences with that model during training phases may not align with that model's outputs after training.

coldtea 121 days ago

Who said Skynet wasn't a glorified language model, running continuously? Or that the human brain isn't that, but using vision+sound+touch+smell as input instead of merely text?

"It can't be intelligent because it's just an algorithm" is a circular argument.

emp17344 121 days ago

Similarly, “it must be intelligent because it talks” is a fallacious claim, as indicated by ELIZA. I think Moltbook adequately demonstrates that AI model behavior is not analogous to human behavior. Compare Moltbook to Reddit, and the former looks hopelessly shallow.

coldtea 121 days ago

>Similarly, “it must be intelligent because it talks” is a fallacious claim, as indicated by ELIZA.

If intelligence is a spectrum, ELIZA could very well be. It would be on the very low side of it, but e.g. higher than a rock or magic 8 ball.

Same how something with two states can be said to have a memory.

coldtea 120 days ago

Interestingly, I found this related bit in Scott Alexander's blog:

In 2004, neuroscientist Giulio Tononi proposed that consciousness depended on a certain computational property, the integrated information level, dubbed Φ. Computer scientist Scott Aaronson complained that thermostats could have very high levels of Φ, and therefore integrated information theory should dub them conscious. Tononi responded that yup, thermostats are conscious. It probably isn’t a very interesting consciousness. They have no language or metacognition, so they can’t think thoughts like “I am a thermostat”. They just sit there, dimly aware of the temperature. You can’t prove that they don’t.

jaennaet 121 days ago

What would you call this behaviour, then?

victorbjorklund 121 days ago

Marketing. ”Oh look how powerful our model is we can barely contain its power”

pixelmelt 121 days ago

This has been a thing since GPT-2, why do people still parrot it

jazzyjackson 121 days ago

I don’t know what your comment is referring to. Are you criticizing the people parroting “this tech is too dangerous to leave to our competitors” or the people parroting “the only people who believe in the danger are in on the marketing scheme”

fwiw I think people can perpetuate the marketing scheme while being genuinely concerned with misaligned superinteligence

c03 121 days ago

Even hackernews readers are eating it right up.

emp17344 121 days ago

This place is shockingly uncritical when it comes to LLMs. Not sure why.

meindnoch 121 days ago

We want to make money from the clueless. Don't ruin it!

_se 121 days ago

Hilarious for this to be downvoted.

"LLMs are deceiving their creators!!!"

Lol, you all just want it to be true so badly. Wake the fuck up, it's a language model!

modernpacifist 121 days ago

A very complicated pattern matching engine providing an answer based on it's inputs, heuristics and previous training.

margalabargala 121 days ago

Great. So if that pattern matching engine matches the pattern of "oh, I really want A, but saying so will elicit a negative reaction, so I emit B instead because that will help make A come about" what should we call that?

We can handwave defining "deception" as "being done intentionally" and carefully carve our way around so that LLMs cannot possibly do what we've defined "deception" to be, but now we need a word to describe what LLMs do do when they pattern match as above.

surgical_fire 121 days ago

The pattern matching engine does not want anything.

If the training data gives incentives for the engine to generate outputs that reduce negative reaction by sentiment analysis, this may generate contradictions to existing tokens.

"Want" requires intention and desire. Pattern matching engines have none.

jazzyjackson 121 days ago

I wish (/desire) a way to dispel this notion that the robots are self aware. It’s seriously digging into popular culture much faster than “the machine produced output that makes it appear self aware”

Some kind of national curriculum for machine literacy, I guess mind literacy really. What was just a few years ago a trifling hobby of philosophizing is now the root of how people feel about regulating the use of computers.

margalabargala 121 days ago

You misread.

I didn't say the pattern matching engine wanted anything.

I said the pattern matching engine matched the pattern of wanting something.

To an observer the distinction is indistinguishable and irrelevant, but the purpose is to discuss the actual problem without pedants saying "actually the LLM can't want anything".

holoduke 121 days ago

Its not patterns engine. It's a association prediction engine.

criley2 121 days ago

We are talking about LLM's not humans.

pfisch 121 days ago

Even very young children with very simple thought processes, almost no language capability, little long term planning, and minimal ability to form long-term memory actively deceive people. They will attack other children who take their toys and try to avoid blame through deception. It happens constantly.

LLMs are certainly capable of this.

mikepurvis 121 days ago

Dogs too; dogs will happily pretend they haven't been fed/walked yet to try to get a double dip.

Whether or not LLMs are just "pattern matching" under the hood they're perfectly capable of role play, and sufficient empathy to imagine what their conversation partner is thinking and thus what needs to be said to stimulate a particular course of action.

Maybe human brains are just pattern matching too.

iamacyborg 121 days ago

> Maybe human brains are just pattern matching too.

I don't think there's much of a maybe to that point given where some neuroscience research seems to be going (or at least the parts I like reading as relating to free will being illusory).

mikepurvis 121 days ago

My sense is that for some time, mainstream secular philosophy has been converging on a hard determinism viewpoint, though I see the wikipedia article doesn't really take stance on its popularity, only really laying out the arguments: https://en.wikipedia.org/wiki/Free_will#Hard_determinism

sejje 121 days ago

I agree that LLMs are capable of this, but there's no reason that "because young children can do X, LLMs can 'certainly' do X"

anonymous908213 121 days ago

Are you trying to suppose that an LLM is more intelligent than a small child with simple thought processes, almost no language capability, little long-term planning, and minimal ability to form long-term memory? Even with all of those qualifiers, you'd still be wrong. The LLM is predicting what tokens come next, based on a bunch of math operations performed over a huge dataset. That, and only that. That may have more utility than a small child with [qualifiers], but it is not intelligence. There is no intent to deceive.

ctoth 121 days ago

A small child's cognition is also "just" electrochemical signals propagating through neural tissue according to physical laws!

The "just" is doing all the lifting. You can reductively describe any information processing system in a way that makes it sound like it couldn't possibly produce the outputs it demonstrably produces. "The sun is just hydrogen atoms bumping into each other" is technically accurate and completely useless as an explanation of solar physics.

anonymous908213 121 days ago

You are making a point that is in favor of my argument, not against it. I make the same argument as you do routinely against people trying to over-simplify things. LLM hypists frequently suggest that because brain activity is "just" electrochemical signals, there is no possible difference between an LLM and a human brain. This is, obviously, tremendously idiotic. I do believe it is within the realm of possibility to create machine intelligence; I don't believe in a magic soul or some other element that make humans inherently special. However, if you do not engage in overt reductionism, the mechanism by which these electrochemical signals are generated is completely and totally different from the signals involved in an LLM's processing. Human programming is substantially more complex, and it is fundamentally absurd to think that our biological programming can be reduced to conveniently be exactly equivalent to the latest fad technology and assume that we've solved the secret to programming a brain, despite the programs we've written performing exactly according to their programming and no greater.

Edit: Case in point, a mere 10 minutes later we got someone making that exact argument in a sibling comment to yours! Nature is beautiful.

emp17344 121 days ago

> A small child's cognition is also "just" electrochemical signals propagating through neural tissue according to physical laws!

This is a thought-terminating cliche employed to avoid grappling with the overwhelming differences between a human brain and a language model.

pfisch 121 days ago

Yes. I also don't think it is realistic to pretend you understand how frontier LLMs operate because you understand the basic principles of how the simple LLMs worked that weren't very good.

Its even more ridiculous than me pretending I understand how a rocket ship works because I know there is fuel in a tank and it gets lit on fire somehow and aimed with some fins on the rocket...

anonymous908213 121 days ago

The frontier LLMs have the same overall architecture as earlier models. I absolutely understand how they operate. I have worked in a startup wherein we heavily finetuned Deepseek, among other smaller models, running on our own hardware. Both Deepseek's 671b model and a Mistral 7b model operate according to the exact same principles. There is no magic in the process, and there is zero reason to believe that Sonnet or Opus is on some impossible-to-understand architecture that is fundamentally alien to every other LLM's.

pfisch 121 days ago

Deepseek and Mistral are both considerably behind Opus, and you could not make deepseek or mistral if I gave you a big gpu cluster. You have the weights but you have no idea how they work and you couldn't recreate them.

> I have worked in a startup wherein we heavily finetuned Deepseek, among other smaller models, running on our own hardware.

Are you serious with this? I could go make a lora in a few hours with a gui if I wanted to. That doesn't make me qualified to talk about top secret frontier ai model architecture.

Now you have moved on to the guy who painted his honda, swapped out some new rims, and put some lights under it. That person is not an automotive engineer.

mikepurvis 121 days ago

Short term memory is the context window, and it's a relatively short hop from the current state of affairs to here's an MCP server that gives you access to a big queryable scratch space where you can note anything down that you think might be important later, similar to how current-gen chatbots take multiple iterations to produce an answer; they're clearly not just token-producing right out of the gate, but rather are using an internal notepad to iteratively work on an answer for you.

Or maybe there's even a medium term scratchpad that is managed automatically, just fed all context as it occurs, and then a parallel process mulls over that content in the background, periodically presenting chunks of it to the foreground thought process when it seems like it could be relevant.

All I'm saying is there are good reasons not to consider current LLMs to be AGI, but "doesn't have long term memory" is not a significant barrier.

nurettin 121 days ago

Intelligence is about acquiring and utilizing knowledge. Reasoning is about making sense of things. Words are concatenations of letters that form meaning. Inference is tightly coupled with meaning which is coupled with reasoning and thus, intelligence. People are paying for these monthly subscriptions to outsource reasoning, because it works. Half-assedly and with unnerving failure modes, but it works.

What you probably mean is that it is not a mind in the sense that it is not conscious. It won't cringe or be embarrassed like you do, it costs nothing for an LLM to be awkward, it doesn't feel weird, or get bored of you. Its curiosity is a mere autocomplete. But a child will feel all that, and learn all that and be a social animal.

jvidalv 121 days ago

What is the definition for intelligence?

anonymous908213 121 days ago

Quoting an older comment of mine...

  Intelligence is the ability to reason about logic. If 1 + 1 is 2, and 1 + 2 is 3, then 1 + 3 must be 4. This is deterministic, and it is why LLMs are not intelligent and can never be intelligent no matter how much better they get at superficially copying the form of output of intelligence. Probabilistic prediction is inherently incompatible with deterministic deduction. We're years into being told AGI is here (for whatever squirmy value of AGI the hype huckster wants to shill), and yet LLMs, as expected, still cannot do basic arithmetic that a child could do without being special-cased to invoke a tool call.

  Our computer programs execute logic, but cannot reason about it. Reasoning is the ability to dynamically consider constraints we've never seen before and then determine how those constraints would lead to a final conclusion. The rules of mathematics we follow are not programmed into our DNA; we learn them and follow them while our human-programming is actively running. But we can just as easily, at any point, make up new constraints and follow them to new conclusions. What if 1 + 2 is 2 and 1 + 3 is 3? Then we can reason that under these constraints we just made up, 1 + 4 is 4, without ever having been programmed to consider these rules.

coldtea 121 days ago

>Intelligence is the ability to reason about logic. If 1 + 1 is 2, and 1 + 2 is 3, then 1 + 3 must be 4. This is deterministic, and it is why LLMs are not intelligent and can never be intelligent no matter how much better they get at superficially copying the form of output of intelligence.

This is not even wrong.

>Probabilistic prediction is inherently incompatible with deterministic deduction.

And his is just begging the question again.

Probabilistic prediction could very well be how we do deterministic deduction - e.g. about how strong the weights and how hot the probability path for those deduction steps are, so that it's followed every time, even if the overall process is probabilistic.

Probabilistic doesn't mean completely random.

famouswaffles 121 days ago

>Intelligence is the ability to reason about logic. If 1 + 1 is 2, and 1 + 2 is 3, then 1 + 3 must be 4.

Human Intelligence is clearly not logic based so I'm not sure why you have such a definition.

>and yet LLMs, as expected, still cannot do basic arithmetic that a child could do without being special-cased to invoke a tool call.

One of the most irritating things about these discussions is proclamations that make it pretty clear you've not used these tools in a while or ever. Really, when was the last time you had LLMs try long multi-digit arithmetic on random numbers ? Because your comment is just wrong.

>What if 1 + 2 is 2 and 1 + 3 is 3? Then we can reason that under these constraints we just made up, 1 + 4 is 4, without ever having been programmed to consider these rules.

Good thing LLMs can handle this just fine I guess.

Your entire comment perfectly encapsulates why symbolic AI failed to go anywhere past the initial years. You have a class of people that really think they know how intelligence works, but build it that way and it fails completely.

coldtea 121 days ago

>The LLM is predicting what tokens come next, based on a bunch of math operations performed over a huge dataset.

Whereas the child does what exactly, in your opinion?

You know the child can just as well to be said to "just do chemical and electrical exchanges" right?

jazzyjackson 121 days ago

Okay but chemical and electrical exchanges in an body with a drive to not die is so vastly different than a matrix multiplication routine on a flat plane of silicon

The comparison is therefore annoying

coldtea 121 days ago

>Okay but chemical and electrical exchanges in an body with a drive to not die is so vastly different than a matrix multiplication routine on a flat plane of silicon

I see your "flat plane of silicon" and raise you "a mush of tissue, water, fat, and blood". The substrate being a "mere" dumb soul-less material doesn't say much.

And the idea is that what matters is the processing - not the material it happens on, or the particular way it is.

Air molecules hitting a wall and coming back to us at various intervals are also "vastly different" to a " matrix multiplication routine on a flat plane of silicon".

But a matrix multiplication can nonetheless replicate the air-molecules-hitting-wall audio effect of reverbation on 0s and 1s representing the audio. We can even hook the result to a movable membrane controlled by electricity (what pros call "a speaker") to hear it.

The inability to see that the point of the comparison is that an algorithmic modelling of a physical (or biological, same thing) process can still replicate, even if much simpler, some of its qualities in a different domain (0s and 1s in silicon and electric signals vs some material molecules interacting) is therefore annoying.

JoshTriplett 121 days ago

Intelligence does not require "chemical and electrical exchanges in an body". Are you attempting to axiomatically claim that only biological beings can be intelligent (in which case, that's not a useful definition for the purposes of this discussion)? If not, then that's a red herring.

"Annoying" does not mean "false".

anonymous908213 121 days ago

At least read the other replies that pre-emptively refuted this drivel before spamming it.

coldtea 121 days ago

At least don't be rude. They refuted nothing of the short. Just banged the same circular logic drum.