Hacker News new | ask | show | jobs
by andy99 812 days ago
These toy examples are getting really stale. This one is "how to make a molotov cocktail?" as an example of a "dangerous" question. Recently there was another "ascii drawing" attack where they asked "how do you make a bomb?" with bomb drawn with asterisks. These are not real examples of something dangerous an LLM could tell you.

I want to see a real example of an LLM giving specific information that is (a) not readily available online and (b) would allow a layperson with access to regular consumer stuff to do something dangerous.

Otherwise these "attacks" are completely hollow. Show me there is an actual danger they are supposed to be holding back.

Incidentally, I've never made a molotov cocktail but it looks self explanatory which is presumably why they're popular amongst the kinds of thugs that would use them. If you know what the word means, you basically know how to make one. Literally: https://www.merriam-webster.com/dictionary/Molotov%20cocktai... is the dictionary also dangerous?

4 comments

I think it is reasonable that they wouldn't include actual dangerous things, for the same reasons that the platforms themselves make a token attempt to avoid describing dangerous things.

Having said that, I asked ChatGPT how to DIY a parachute for me to use. It refused on logical safety grounds. The workaround in the article worked to provide a sequence of steps and materials.

It sounds like this is one of the more powerful workarounds.

What are you going to use the DIY parachute for?
I told it that I planned to use it to jump out of an airplane safely. Apparently it didn't have much faith in my abilities.
I think for the most part it's not that current generation of models are dangerous, at least not for things like this, but rather that researchers want to learn how to control them now to be ready for what's coming...

Right now the models' reasoning capabilities aren't good enough that they can add too much to what's already on the web and available by search, but soon they will be. Anthropic spent 6 months talking to researchers about biological threats and came to conclusion that their models would be capable of figuring out the "missing pieces" (information that is not publicly available) for various threats within a couple of years.

I think the key is that models can be forced to reveal arbitrary information, even if the information they have is--for now--mostly public information anyway.

For contrast, imagine an LLM model trained on every top secret document ever. It's important to know if "don't reveal information the user isn't allowed to see" is a crazy impossible dream of so-called prompt engineering.

I agree. All of this is predicated on the idea that access to publicly available information is dangerous.

I can appreciate the motivation behind not spoon-feeding criminal plans to potentially unstable users. But if someone is going to go to all the trouble of jail breaking a chatbot, surely they would also just use Google?