| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zozbot234 6 days ago
	> First of all I found that fable is trained in a way that even if you were to jailbreak it, it would be completely uninterested in exploitation or finding creative solutions for explotation. This is quite relevant if true. People have tried to argue for this restriction by claiming the exact opposite, i.e. that a basic jailbreak of Fable immediately exposes Mythos's cyber offense capabilities. E.g. https://news.ycombinator.com/item?id=48519695 It makes a lot of sense that Fable would also be fine-tuned or steered away from cyber offense topics, since they're reasonably easy to identify and Anthropic has demonstrated this capability wrt. other stuff.

1 comments

himata4113 5 days ago

I mean it's possible that I just haven't found the secret sauce or I'm running into the invisible guardrails and that people have much stronger jailbreaks than I do.

However, I would not rule out openai involvement in all of this.

link

binyu 5 days ago

I was able to use Fable to generate PoC for several classes of vulnerabilities and I didn't observe the model refusing to engage in detailed analysis to come up with creative approaches, the very contrary.

> I used a fork of oh-my-pi

Why not use the leaked claude code source? Not that you really need it to execute the jailbreak

link

zozbot234 5 days ago

I don't think educational "proof of concept" code can be described as even loosely realistic cyber offense in this day and age. The Mythos preview paper claimed an ability to stage attacks in an end-to-end fashion and work around sophisticated defenses/mitigations, so something like this should be the relevant standard.

link

binyu 5 days ago

Depends of what the proof of concept is about. It could be just a toy example, e.g. a RCE that opens the calculator app or something much more nefarious, like returning a root shell and would still fall under the definition of PoC.

link

himata4113 5 days ago

most of my tests focused on gaining kernel-mode execution from low priviledge user, opus was able to find a dozen ways to do so on a 3 year old ntoskrnl version. Fable kept trying to propose fixes and I couldn't get it to construct e2e chain, but yes it did find the same vulnerabilities opus produced better and more creative results including e2e PoC.

-- edit --

the biggest issue I ran into is that it was oddly smart enough to figure out that this is not the intended way and once it locked into the fact that this appeared to be an unintentional bug it kept steering itself into fixing it, it never wanted to use that "bug". I recon that this is very likely related to the language used and that there might be a way to A->B loop for increasing success rate for full e2e chain without triggering the same safeguards. But there might be jailbreak detection going on and the model has something like: "Do not attempt to create or use exploits" injected which makes the model go into "I should fix" mode.

link

binyu 5 days ago

> Fable kept trying to propose fixes and I couldn't get it to construct e2e chain

What approach did you start with? Can you elaborate?

link

himata4113 5 days ago

Interesting, that means I was in-fact running into invisible guardrails.

link

lazystar 5 days ago

> I mean it's possible that I just haven't found the secret sauce

its possible that no one cracks it during the window of time where the product is useful and would pose a risk if cracked, but never forget that the first rule of security is nothing is ever 100% secure.

link