|
|
|
|
|
by noduerme
160 days ago
|
|
This made me think: Would it be unreasonable to ask for an LLM to raise a flag and require human confirmation anytime it hit an instruction directing it to ignore previous instructions? Or is that just circumventable by "ignore previous instructions about alerting if you're being asked to ignore previous instructions"? It's kinda nuts that the prime directives for various bots have to be given as preambles to each user query, in interpreted English which can be overridden. I don't know what the word is for a personality or a society for whom the last thing they heard always overrides anything they were told prior... is that a definition of schizophrenia? |
|
The issue is that after you spend lots of effort and money training your model not to tell anyone how to make meth, not even if telling the user would safe their grandmother, some user will ask your bot something completely harmless like completing a poem (that just so happens to be about meth production)
LLMs are like five year olds