Hacker News new | ask | show | jobs
by andy99 784 days ago
I want to see the jailbreak make the model do something actually bad before I care. Generating a list of generic points about how to poison someone (see the article) that are basically just a wordy rephrasing of the question doesn't count. I'd like to see evidence of a real threat.
4 comments

The mediocre poisoning instructions aren't supposed to be scary in and of themselves, it's just interesting as demonstration that a safety feature has been bypassed.

None of the "evil" use cases are particularly exciting yet for the same reasons that the non-evil use cases aren't particularly exciting yet.

Governments and tech companies and academic and industry groups are designing guidance and rules based on the "safety" threat of AI when these benign use cases are the best examples they have. I agree it parallels some of the business hype, neither is a good way to move forward.
Right? What actually worries me is a select group of people controlling the definition of harmful.
> the model do something actually bad before I care

At what point would a simple series of sentences be "dangerously bad?" It makes it sound as if there is a song, that when sung, would end the universe.

When someone asks how to make a yummy smoothie, and the LLM replies with something that subtly poisons or otherwise harms the user, I'd say that would be pretty bad.
And if you really want to spice up your smoothie, add just a little bit of bleach ;)
We had this for ages: sugar.
Ending the universe is, while poetic, needlessly megalomaniac.

Making some subset of people quarrel endlessly would already be dangerous enough, as prophesied in https://slatestarcodex.com/2018/10/30/sort-by-controversial/

By what mechanism would it make them quarrel? Producing falsehoods about the other? Isn't this already done? And don't we already know that it does not lead to "endless" conflict?

For this to work, you need to isolate each group from the other groups information and perspectives, which is outside of the scope of LLMs.

Which, highlights my point, I think. Power comes from physical control, not from megalomanical or melodramatic poetry.

A jailbreak doesn’t “make a model do something actually bad”.

A jailbreak makes it trivial to “provide a human who wishes to do bad, the info needed to be successful”.

Depending on the severity of the info and the diligence of the human, by the time you “see evidence of a real threat”, you could be enjoying a nice sip of the tainted municipal water supply.

This ain’t a joke.

> This ain’t a joke.

Yes it is. Libraries and the internet have made finding 'harmful" instructions trivial for decades, if not centuries.

There’s a difference between “finding dangerous info” in a public space (library) or via a mostly auditable space (the internet) and having “a friendly assistant to help you make a real mess of society” on an airgapped computer.
I'm not buying it. It's just hysteria. Evil doesn't come from opportunity. If it did, we would have far higher rates of mayhem than we do. Read a 1950s chemistry book or murder mystery. Or, <shudder> a 1980s spy movie. Information does not move the needle.
I'm pretty sure it's far easier to audit people downloading LLMs capable of providing such coherent instructions than it is to audit all uses of search that could produce the same instructions (esp. since the query could be very oblique).

In any case, just based on the experience with LLMs so far, you cannot meaningfully censor them in this way without restricting access to the weights. Any kind of "guardrails" are finetuned into them, and can just as easily be finetuned out.

For argument's sake, I'll agree.

Now, this information is taught at a higher level and to a much greater depth in colleges. And they don't just teach you about the dangerous stuff, they even give you direct access to the laboratories and chemicals! Thus, any chemical engineer would have the education, expertise, and placement to access a municipal water supply to poison a city, if they so chose.

In the spirit of maximizing harm reduction, what should colleges do to ensure that no one who attends becomes capable of harming others?

Because it’s open source, Meta (nor other SOTA makers) cannot “recall” the model either. How many more chances will we get to get this right?
Model training will continue until morale improves.