|
|
|
|
|
by SomewhatLikely
645 days ago
|
|
This feels similar to those adversarial examples that first came out that were very tuned for a specific image recognizer. I haven't followed the research but I know they had some very limited success to getting it to work in the real world. I'm not sure if they ever worked across different models though. The paper claims there is literature with more success for LLMs: Large language models have been shown to be vulnerable to adversarial
attacks, in which attackers introduce maliciously crafted token sequences
into the input prompt to circumvent the model’s safety mechanisms and
generate a harmful response [1, 14].
|
|