Hacker News new | ask | show | jobs
by matus-pikuliak 297 days ago
That is absolutely not a reliable defense. Attackers can break these defenses. Some attacks are semantically meaningless, but they can nudge the model to produce harmful outputs. I wrote a blog about this:

https://opensamizdat.com/posts/compromised_llms