Hacker News new | ask | show | jobs
by ai_what 784 days ago
This has been happening since the very first models where we suffix the assistant with "Sure,.." Every few weeks someone comes out with a repo that claims this is somehow new?
1 comments

The point is that even though meta “conducted extensive red teaming exercises with external and internal experts to stress test the models” a simple attack like this is still possible.