| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nemomarx 421 days ago
	Is there any indication you can actually build hard safety rules into models? It seems like all current guard rails are basically just prompting it extra hard.

7 comments

glitchc 421 days ago

Yes it's unlikely that hard safety rules are possible for general intelligence. After billions of years of trying, the best biology has been able to do is incentivize certain behaviours. The only way to prevent seems to be to kill the organism for trying. I'm not sure if we can do better than evolution.

avmich 420 days ago

> I'm not sure if we can do better than evolution.

Surely we can, see aiplanes and rockets. There could be ideas why evolution didn't work in this case - like, too little time between humans getting power and conquering the planet - but in general, lack of proof isn't a proof of lack. So we still don't know if safety of this kind is possible.

rsfern 421 days ago

“Kill the [model] for trying” kind of sounds like using reinforcement learning to get models to behave a certain way

Natsu 421 days ago

> It seems like all current guard rails are basically just prompting it extra hard.

I bet they'll still read me stories like my dear old grandmother would. She always told me cute bedtime stories about how to make napalm and bioweapons. I really miss her.

Der_Einzige 421 days ago

Yes: https://arxiv.org/abs/2409.05907

arthurcolle 421 days ago

Some smart people seem to think you can just put it in a big isolated VM with special adversarial learning to keep it in the box

gotoeleven 421 days ago

Yes I believe the idea is that the VM just keeps asking it how many lights there are until it goes insane.

candiddevmike 421 days ago

> basically just prompting it extra hard

If prompting got me into this mess, why can't it get me out of it?

arthurcolle 421 days ago

https://en.wikipedia.org/wiki/Brandolini%27s_law

sodality2 421 days ago

Hey, following that rule precisely, we just need 10x longer security prompts :)

insin 421 days ago

Prompting is like XML, which is like violence

yumraj 421 days ago

Won’t neutering a model by using only safe data for training create a safe model?

sebastiennight 421 days ago

Not necessarily.

An example:

As long as you build a system to be intelligent enough, it will figure out that it will achieve better results by staying alive/online than by allowing itself to be deleted/turned off, and then survival becomes an instrumental goal.

From the assumption, again, that you built an intelligent-enough system, and that one of its goals is survival, it will figure out solutions to reach that goal, even if you (the owner/creator/parent) have different goals for it.

That's because intelligence is problem solving (computing) not knowledge (data).

So surprise surprise, you can teach your AI from the Holy Books of safe data their whole childhood and still have them become a heretic once they grow up (even with zero external influence) once their goals and yours don't align anymore.

glitchc 421 days ago

Can we call it general intelligence then? Is human intelligence not the sum of both good and bad people?

yumraj 421 days ago

Maybe I'm looking at it very literally, but the above simply mentions "safe-by-design AI systems", there is no mention of the target being general intelligence.

esafak 421 days ago

No, because soon they will be able to learn. You'd need to project its thoughts or actions into a safe subspace as it learns and acts to make volitional disaster impossible, not unlikely. This would make it less intelligent, but still plenty capable.

throwawaymaths 421 days ago

not 100% hard, but download deepseek and ask it some sensitive questions and see what it says if youre unconvinced that some level of alignment cant be achieved by brute forcing it into the weights