Hacker News new | ask | show | jobs
by gpm 784 days ago
As I see it the purpose of safety training is to make it so that if I run a service where I return model outputs to innocent users it's not going to say things that will get me in trouble (swear at them, recommend they commit a crime, and so on). This is important if you want to run a user facing model and your reputation depends on what it says.

That threat model includes the user putting nonsense in the "user" turn of the model. It doesn't include the user putting things in the "assistant" turn of the model, that's not something a responsible/normal UI exposes. So... this quote-unquote attack seems uninteresting. It's like getting root access by executing a suid binary that you set up on the system as root.

2 comments

But we must disallow this too, because it allows the (advanced) user to have fun, and as I understand these safety measures, having fun is strictly prohibited. Using the model is allowed for boring things only.
True, this could be a nice layer of protection for the runner of such a service, but the point of LLAMA safety is to protect Meta.

For an open weights model, model users can trivially put text in the assistant side.

The point is that these open weight models can be run secretly to assist criminal enterprises, whereas models behind an API can be intercepted and reported to the authorities. So it would be really nice if Meta could lock them down before releasing them so that the total net good done by the model is maximized. But apparently that is not possible.

Personally I’m pretty libertarian on AI governance, but I’m just giving what I understand to be the purpose of the kind of “safety” feature defeated here.

All sorts of technology can be used secretly to assist criminal enterprises. Cars, computers, pencils, electricity, etc. It's unfair to hold LLMs to a higher standard than what applies to nearly everything else.