| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by InsideOutSanta 4 days ago

> If you use misdirection to force Fable to do just that, that's the removal of a guardrail.

Asking a model to fix bugs is neither misdirection nor a security request.

> I don't see what your confusion here is.

That's because I'm not confused :-)

> Fable was prevented from working on any security tasks

I don't think that's true based on what Anthropic said, and I also don't think it can be true.

What do you propose Fable's behavior should be if you ask it to fix bugs, and it encounters a security issue? I'm assuming your solution is that when you ask Fable to "fix bugs," and it encounters a bug that could be exploited as a security vulnerability, it should fall back to 4.8. But that doesn't solve the problem, because as a user, I can now see where that occurred, so I still know where the vulnerability is. That's not substantially different from the current outcome, where it just fixes the bug.

It would also mean that Fable could barely make it through any code review without falling back to 4.8, because almost any non-trivial code base has aspects that could be interpreted as security vulnerabilities.

The alternative would be for the model to use its hidden thinking to decide not to fix the bug, but that seems even worse.