| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by chasd00 118 days ago

"safe" is such a subjective concept to begin with, have any of the model providers ever defined what they mean by "safe"?

It doesn't mean much to me if a safe model is one that does not output the recipe for mustard gas, that information is trivially available elsewhere.

Or, is a safe model one that doesn't come off as racist? Ok but i would classify that as unoffensive instead of safe but I admit definitions of words can be fluid and change.

Is a safe model one that refuses to produce code for a weapons system? Well.. does a PID controller count? I can use that to keep a gun pointed at a target or i can use that to prevent a baby rocker from falling over.

Maybe they're giving up on "safe" because there's no definitive way to know if a model is safe or not. I've always held the opinion that ai safety was more about brand safety. Maybe now the model providers can afford some bad press and it not be the death of their company.

8 comments

wongarsu 118 days ago

My preferred version of "safe" is "in its actions considers and mostly upholds usually unstated constraints like 'don't kill unless necessary', 'keep Earth inhabitable', 'avoid toppling society unless really well justified for the greater good', etc. The kind of framing that was prevalent pre-ChatGPT. Not terribly relevant for a chat software, but increasingly important as chat models turn into agents.

Of course once you have that framing, additional goals like "don't give people psychosis", "don't give step-by-step instructions on making explosives, even if wikipedia already tells you how to do it" or "don't harm our company's reputation by being racist" are conceptually similar.

On the other hand "don't make weapon systems" or "never harm anyone" might not be viable goals. Not only because they are difficult to impossible to define, but also because there is huge financial and political pressure not to limit your AI in that way (see Anthropic)

link

pjc50 118 days ago

> I can use that to keep a gun pointed at a target or i can use that to prevent a baby rocker from falling over.

This leads to what I'm going to call the "Ender's Game" approach: if your AI is uncooperative just present it with a simulation that it does like but which maps onto real-world control that it objects to.

> I've always held the opinion that ai safety was more about brand safety

Yes. The social media era made that very important. The extent to which brand safety is linked to actual, physical safety then becomes one of how you can manage the publicity around disasters. And they're doing a pretty good job of denying responsibility.

link

LordHumungous 118 days ago

What if I tell the model to go commit fraud or crimes and it complies? What if users are having psychotic episodes driven by their interactions with the model?

Just because safety is a hard and messy problem doesn't mean we should just wash our hands of it.

link

ryandrake 118 days ago

It is a hard and messy problem, and it doesn't help when people muddy the water further by stirring things like "Don't commit fraud," "Don't infringe on Disney's trademark," and "Don't be racist" into the mix and try to lump those things under the "Safety" umbrella.

Maybe this is an outdated definition, but I've always thought of safety as being about preventing injury. Things like safety glasses and hardhats on the work site, warning about slippery floors and so on. I think people are trying to expand the word to mean a great many more things in the context of AI, which doesn't help when it comes to focusing on it.

I think we need a different, clearer word for "The AI output shouldn't contain certain unauthorized things."

link

Aperocky 118 days ago

The more messy a problem is, the less it should be decoupled and siloed into its own team.

Instead of making actual improvement on the subject (you name it, safety, security, etc), it becomes a checkbox exercise and metrics and bureaucracies become increasingly decoupled from truth.

link

miltonlost 118 days ago

But think about how much money there is to be made by just ignoring it all!

link

some_random 118 days ago

>Is a safe model one that refuses to produce code for a weapons system? Well.. does a PID controller count? I can use that to keep a gun pointed at a target or i can use that to prevent a baby rocker from falling over.

I've been using LLMs for some cyber-y tasks and this is exactly how it ends up going. You can't ask "hack this IP" (for some models), but more discrete tasks it'll have no such qualms.

link

bluecheese452 118 days ago

Those are some really interesting questions. To me giving a mustard gas receipt to someone with no intent to use it is unlikely to be dangerous. On the other hand some particularly inflammatory racial propaganda in an area with simmering ethnic tensions is very likely to be dangerous.

But give that same recipe to a wannabe terrorist and suddenly it is dangerous. Context matters, not just the information.

link

0_____0 118 days ago

I think the problem of chatbot "safety" mirrors that of autonomous vehicle safety. For an AV, the correct course of action is one that avoids hitting stuff (including people, vehicles) and, critically, minimizes liability.

link

Davidzheng 118 days ago

Well I do think there's some degree of unsafeness which is inexorably linked to capability--if the model when deployed with full control of a machine is capable of large scale cyberattacks and blackmailing for example.

link

justonceokay 118 days ago

> because there's no definitive way to know if a model is safe or not.

The only answer is there’s no money on it being safe. It is not an epistemic problem

link