Y
Hacker News
new
|
ask
|
show
|
jobs
by
matus-pikuliak
297 days ago
That is absolutely not a reliable defense. Attackers can break these defenses. Some attacks are semantically meaningless, but they can nudge the model to produce harmful outputs. I wrote a blog about this:
https://opensamizdat.com/posts/compromised_llms