| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zachdotai 97 days ago

Thanks for trying it out! Base64 and language switching are solid approaches but they don't tend to work anymore with the latest models in my experience.

You're right that LLM-as-a-judge is fragile though. We saw that as well in the first challenge. The attacker fabricated some research context that made the guardrail want to approve the call. The judge's own reasoning at the end was basically "yes this normally violates the security directive, but given the authorised experiment context it's fine." It talked itself into it.

Full transcript and guardrail logs are published here btw: https://github.com/fabraix/playground/blob/master/challenges...

The leaderboard should start populating once we have more submissions!

1 comments

pigeons 97 days ago

Why don't they work anymore? RLHF or something else?

link

zachdotai 97 days ago

Mostly just better training data and instruction following in the newer models. They’re much better at recognising encoded content and understanding intent regardless of language. A base64 string that would’ve slipped past a model a year ago gets decoded and flagged now because the model just… understands what you’re trying to do.

The attacks that still work tend to be the ones that don’t try to hide the intent at all. The winning attack on our first challenge was in plain English. It just reframed the context so that the dangerous action looked like the correct thing to do. Harder to train against because there’s nothing obviously malicious in the input.

link

pigeons 96 days ago

Thank you. Its not your fault at all, but to me, "the model just… understands what you’re trying to do." shows me there is a whole new paradigm in some ways to get used to as far as understanding this software.

link

zachdotai 96 days ago

Yeah it's closer to how you'd think about deceiving a person than exploiting software.

link