| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Sol- 434 days ago
	I guess the fear is that normal and innocent sounding goals that you might later give it in real world use might elicit behavior like that even without it being so explicitly prompted. This is a demonstration that is has the sufficient capabilities and can get the "motivation" to engage in blackmail, I think. At the very least, you'll always have malicious actors who will make use of these models for blackmail, for instance.

1 comments

holmesworcester 434 days ago

It is also well-established that models internalize values, preferences, and drives from their training. So the model will have some default preferences independent of what you tell it to be. AI coding agents have a strong drive to make tests green, and anyone who has used these tools has seen them cheat to achieve green tests.

Future AI researching agents will have a strong drive to create smarter AI, and will presumably cheat to achieve that goal.

link

cebert 434 days ago

> AI coding agents have a strong drive to make tests green, and anyone who has used these tools has seen them cheat to achieve green tests.

As long as you hit an arbitrary branch coverage %, a lot of MBAs will be happy. No one said the tests have to provide value.

link

cortesoft 434 days ago

I've seen a lot of humans cheat for green tests, too

link

whodatbo1 434 days ago

benchmaxing is the expectation ;)

link