| > Any evidence in this area? See danShumway's post below. People are regularly posting exploits on twitter, including getting the system to dump it's prompt. May I ask politely, are you a programmer, and have you secured system's previously? It will change the way I approach trying to carry my message across. For background, a finished LLM is a blackbox. You can't program the LLM in the box in the traditional sense, because we don't fully understand what happens in the box at a level where we can "code" it. Judging the security of a filter by the cases where it works is a very bad way to judge security. Blocklists ARE NOT SAFE because it is impossible to account for the infinite variety of things that can be tried. Here's a whitepaper on the difficulties. There's been lots of writing about this: https://research.nccgroup.com/wp-content/uploads/2020/07/ncc... Now, this has been shown to be difficult for really constrained scenarios, like SQL and so forth, but English has a million words, for starters. |