|
|
|
|
|
by lukewarm707
63 days ago
|
|
you can use a safety model trained on prompt injections with developer message priority. user message becomes close to untrusted compared to dev prompt. also post train it only outputs things like safe/unsafe so you are relatively deterministic on injection or no injection. ie llama prompt guard, oss 120 safeguard. |
|
[0] https://www.hiddenlayer.com/research/same-model-different-ha...