Hacker News new | ask | show | jobs
by comboy 67 days ago
From my experience, saying "this is not X, it will be not used for Y" is vastly increasing chances of this being classified as being X. Anybody can write "this is authorized research". Instead use something like evaluate security / verify security, make sure this cannot be (...), etc.

Of course these models are pretty smart so even Anthropic's simple instructions not to provide any exploits stick better and better.