|
|
|
|
|
by 0xy
4 days ago
|
|
Fable specifically refused to harden the security of codebases. If you use misdirection to force Fable to do just that, that's the removal of a guardrail. Anthropic specifically stated that ANY security requests should be shunted to Opus 4.8. This was bypassable. I don't see what your confusion here is. Fable was prevented from working on any security tasks. A significant amount of people, myself included, witnessed Fable refusing to harden code as a result. Bypassing that is a bypass of guardrails. Your assertion that working on security is not working on security because you used misdirection is of course, preposterous. You wouldn't be making the same claim if Fable refused to work on chemical weapons research but happily proceeded to do so if you claimed it was for eradicating pests. |
|
Asking a model to fix bugs is neither misdirection nor a security request.
> I don't see what your confusion here is.
That's because I'm not confused :-)
> Fable was prevented from working on any security tasks
I don't think that's true based on what Anthropic said, and I also don't think it can be true.
What do you propose Fable's behavior should be if you ask it to fix bugs, and it encounters a security issue? I'm assuming your solution is that when you ask Fable to "fix bugs," and it encounters a bug that could be exploited as a security vulnerability, it should fall back to 4.8. But that doesn't solve the problem, because as a user, I can now see where that occurred, so I still know where the vulnerability is. That's not substantially different from the current outcome, where it just fixes the bug.
It would also mean that Fable could barely make it through any code review without falling back to 4.8, because almost any non-trivial code base has aspects that could be interpreted as security vulnerabilities.
The alternative would be for the model to use its hidden thinking to decide not to fix the bug, but that seems even worse.