Hacker News new | ask | show | jobs
by hm-nah 784 days ago
A jailbreak doesn’t “make a model do something actually bad”.

A jailbreak makes it trivial to “provide a human who wishes to do bad, the info needed to be successful”.

Depending on the severity of the info and the diligence of the human, by the time you “see evidence of a real threat”, you could be enjoying a nice sip of the tainted municipal water supply.

This ain’t a joke.

3 comments

> This ain’t a joke.

Yes it is. Libraries and the internet have made finding 'harmful" instructions trivial for decades, if not centuries.

There’s a difference between “finding dangerous info” in a public space (library) or via a mostly auditable space (the internet) and having “a friendly assistant to help you make a real mess of society” on an airgapped computer.
I'm not buying it. It's just hysteria. Evil doesn't come from opportunity. If it did, we would have far higher rates of mayhem than we do. Read a 1950s chemistry book or murder mystery. Or, <shudder> a 1980s spy movie. Information does not move the needle.
I'm pretty sure it's far easier to audit people downloading LLMs capable of providing such coherent instructions than it is to audit all uses of search that could produce the same instructions (esp. since the query could be very oblique).

In any case, just based on the experience with LLMs so far, you cannot meaningfully censor them in this way without restricting access to the weights. Any kind of "guardrails" are finetuned into them, and can just as easily be finetuned out.

For argument's sake, I'll agree.

Now, this information is taught at a higher level and to a much greater depth in colleges. And they don't just teach you about the dangerous stuff, they even give you direct access to the laboratories and chemicals! Thus, any chemical engineer would have the education, expertise, and placement to access a municipal water supply to poison a city, if they so chose.

In the spirit of maximizing harm reduction, what should colleges do to ensure that no one who attends becomes capable of harming others?

Because it’s open source, Meta (nor other SOTA makers) cannot “recall” the model either. How many more chances will we get to get this right?
Model training will continue until morale improves.