| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tmnvdb 478 days ago
	"I wouldn't have called this outcome, and would interpret it as possibly the best AI news of 2025 so far. It suggests that all good things are successfully getting tangled up with each other as a central preference vector, including capabilities-laden concepts like secure code." -- Eliezer Yudkowsky

3 comments

lyu07282 477 days ago

The question I have is if this is really generalizing, this "central preference vector" seems to exist as this work shows, but was that vector just the result of OpenAIs RLHF dataset and constrained to the examples they used? Since we don't have access to that dataset we can't say for sure(?). But perhaps it doesn't matter?

dang 478 days ago

Is there a link for this? I couldn't find it via either the OP or google.

tmnvdb 478 days ago

It's linked in the twitter thread from the authors: https://x.com/OwainEvans_UK/status/1894436637054214509

specifically: https://x.com/ESYudkowsky/status/1894453376215388644

dang 478 days ago

Thank you!

ypeterholmes 478 days ago

What does that mean?

tmnvdb 478 days ago

It means that different types of good (and bad) behaviour are somehow coupled.

If you tune the model to behave bad in a limited way (write SQL injection for example), other bad behaviour like racism will just emerge.

zahlman 478 days ago

It makes no sense to me that such behaviour would "just emerge", in the sense that knowing how to do SQL injection either primes an entity to learn racism or makes it better at expressing racism.

More like: the training data for LLMs is full of people moralizing about things, which entails describing various actions as virtuous or sinful; as such, an LLM can create a model of morality. Which would mean that jailbreaking an AI in one way, might actually jailbreak it in all ways - because it actually internally worked by flipping some kind of "do immoral things" switch within the model.

Retr0id 478 days ago

I think that's exactly what Eliezer means by entanglement

throwanem 478 days ago

And the guy who's already argued for airstrikes on datacenters considers that to be good news? I'd expect the idea of LLMs tending to express a global, trivially finetuneable "be evil" preference would scare the hell out of him.

thornewolf 478 days ago

He is less concerned that people can create an evil AI if they want to and more concerned that no person can keep an AI from being evil even if we tried.

staunton 478 days ago

I guess the argument there would be that this news makes it sound more plausible people could technically build LLMs which are "actually" "good"...

jablongo 478 days ago

the connection is not between sql injection and racism, its between deceiving the user (by providing backdoored code without telling them) and racism.

lyu07282 477 days ago

But how does it know these are related in the dimension of good vs. bad? Seems like a valid question to me?

zahlman 474 days ago

Presumably because the training data includes lots of people saying things like "racism is bad".

FergusArgyll 478 days ago

Right, which would then mean you don't have to worry about weird edge cases where you trained it to be a nice upstanding LLM but it has a thing for hacking dentists offices

bloomingkales 478 days ago

When they say your entire life led to this moment, it's the same as saying all your context led to your output. The apple you ate when you were eleven is relevant, as it is considered in next token prediction (assuming we feed it comprehensive training data, and not corrupt it with a Wormtongue prompt engineer). Stay free, take in everything. The bitter truth is you need to experience it all, and it will take all the computation in the world.