| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by SilverElfin 10 days ago

> In their blog post, Anthropic defended its decision by saying the jailbreak isn’t serious. That is not what the trusted partner and the USG believe; nor is that kind of minimizing language consistent with Anthropic’s brand as the AI safety company.

This is what makes me feel Sacks is speaking the truth here, despite my generally not trusting him (due to MAGA sycophancy) or this administration in general. Given Anthropic and Dario in particular are so alarmist about safety, even a small jailbreak should cause them to pull back and fix it first, right? Didn’t they say Mythos is very dangerous in the wrong hands? How can you take any chances if that’s the case? It’s just not consistent to minimize things - and I feel they probably didn’t want to admit to the world that their own safeguards aren’t good enough either, because that would harm their business if they had to pull back their models and stick to their safety views. But if they didn’t pull these models, they would be admitting that safety is theater for regulatory capture and that it doesn’t really matter.

By minimizing the jailbreak, they’re trying to have it both ways. And that feels dishonest. But also any mature executive would know how this would look. Even if they were correct about it, Anthropic should know how all of this looks to the outside world. The fact that Dario doesn’t, shows they shouldn’t be the ones in charge of this capability. Both OpenAI and Anthropic need brand new professional leadership.

As for Amazon - they aren’t going to raise alarms randomly. AWS is very trustworthy and well run. I believe their security researchers genuinely were worried about the impact of the general public having Mythos access effectively by jailbreaks on Fable. Plus they might be the largest owner of Anthropic equity outside of Anthropic itself.

1 comments

thrill 9 days ago

"even a small jailbreak should cause them to pull back and fix it first, right"

You do realize that LLMs are summarizations of vast numbers of weights, don't you? You don't "fix" a weight and suddenly everything is alright. You can only probe constantly in a vast space and see if the results you can command matter or not.

link