Hacker News new | ask | show | jobs
by davesque 62 days ago
> We stated that we would keep Claude Mythos Preview’s release limited and test new cyber safeguards on less capable models first. Opus 4.7 is the first such model: its cyber capabilities are not as advanced as those of Mythos Preview (indeed, during its training we experimented with efforts to differentially reduce these capabilities). We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses.

It feels like this is a losing strategy. Claude should be developing secure software and also properly advising on how to do so. The goals of censoring cyber security knowledge and also enabling the development of secure software are fundamentally in conflict. Also, unless all AI vendors take this approach, it's not going to have much of an effect in the world in general. Seems pretty naive of them to see this as a viable strategy. I think they're going to have to give up on this eventually.

9 comments

The fundamental tension is that the models are getting weirdly good at hacking while still sort of sucking at a bunch of economically valuable tasks.

So they've hit the point where the models are simultaneously too smart (dangerous hacking abilities) and too stupid (can't actually replace most employees). So at this point they need to make the models bigger, but they're already too big.

So the only thing left to do is to make them selectively stupider. I didn't think that would be possible, but it seems like they're already working on that.

models are getting weirdly good at hacking while still sort of sucking at a bunch of economically valuable tasks

like most human hackers

Honestly I feel sometimes like about the only thing they do successfully is hacking. Not just in the sense of breaking into systems that are assumed to be secure although also in that sense. They're just, highly effective at fumbling around with a hatchet until something works. We just happen to have version control and automated testing that generally makes that approach somewhat viable for the task of programming. But while I've been genuinely impressed at how much it can put features into a workable state, I've never been confident looking at its output that it's going to do more than POC quality at the current state of things. But it's pretty dang effective at that given enough time and a space safe to hack away and reset until the product looks close enough.
"Genius is but the capacity to take infinite pains."
You know, that's also true. I am where I am because I'm stubborn AF and just keep hacking on things until they work. Maybe one of the biggest differences is just ego, lol.
They are training them on decompilation and reverse engineering/blackbox reimplementations/pentesting because it’s one of the best ways to generate interesting and rare RL traces for agentic coding AND teach them how lots of things work under the hood.

Just throw Claude at millions of binaries and you can get amazing training data. Oh wait 4.7 gives you refusals for that now

This is a price discrimination/upsell strategy. Sure, if you just want software, use our public model. Don’t worry; it’s safe.

But if you want your model to be secure, and you want to deal with dangerous stuff, contact us for pricing. BTW if you don’t pay for us to pentest you, maybe someone else will, idk.

Oh also you’re not allowed to pentest yourself with our public models anymore because it looks like hacking

Yes, it's a losing strategy; no one else is going to do this. They are inviting parties to partner with them, so it's not totally in conflict, but yeah I'm sure there's genuine concern coming out of Anthropic, but I also think as this point they've likely culturally internalized "Dangerous [think: powerful] AI" as a brand narrative.

"The Beware of Mythos!" reads to me as standard Anthropic/Dario copy. Is it more true now than it was before? Sure. Is now the moment that the world's digital infrastructure succumbs to waves of hackers using countless exploits; I doubt it.

>Is now the moment that the world's digital infrastructure succumbs to waves of hackers using countless exploits; I doubt it.

I am not into cybersecurity but the existing "technical debt" in terms of security has been barely exploited.

The issue is that literally all software has some vulnerability, want it or not. And these LLMs are like brute forcing all possibilities faster than a human can do. Sometimes humans even ignore low security issues, while maybe these LLMs are capable to build exploits on top of multiple ones.

For me they understood the moat - cybersecurity is such a trivial space to get into, I guess they are investing heavily on that because as someone else mentioned in other threads, it's obvious they are too limited for other tasks.

Becoming a "mandatory" (SOC-2 etc, things like that) integrated part of your CI/CD pipeline would be a huge win for them. Imagine that.

This is the company that allowed a vibe-release resulting in the leaking the entirety of the Claude Code codebase. What is the bar you're expecting here exactly?
I feel it’s fine as a short term solution, and probably a good thing. Gives the good guys some time to stay on top.

Always remember: a defender must succeed every time , an attacker only once.

Given the list of very large companies in the "glasswing" project - it is likely every competent state actor and criminal organization already has access to Mythos in one way or another. Meanwhile the opensource volunteers responsible for the security of the entire internet don't have access.
It's not an easy problem to solve. You can identify certain open source projects that you deem critical and give them access too in a private fashion (maybe even under NDA). Not every state actor will have early access; Russia and the Chinese surely won't, and that matters in current affairs. It's probably only the US gvmt, not even European allies, who currently can use Mythos. The announcement specifically says "Anthropic has also been in ongoing discussions with US government officials about Claude Mythos Preview".

There is no good solution to this. Only less bad. It annoys me a bit that many comments on HN imply that open-sourcing everything right away is the answer to everything. To be clear, I'm not annoyed at your comment specifically, it's more an overall sentiment that I perceive here that I feel is very complacent. We've already seen how OSS maintainers get overwhelmed by AI vulnerability reports; I feel it's a responsible thing to gatekeep this for as long as possible (which really is only a few months, at most - other models catch up fast), and try to work with important maintainers directly to help fix the most critical stuff and onboard them to a new world of the AI-assisted cat-and-mouse security game.

This is just damage control. The damage, i.e. the attack capabilities opened up by this, is pretty brutal, and likely requires a substantial shift in mindset from OSS maintainers. This approach gives a few months of transition time. Who decides who is an important maintainer and who isn't? Again, super grey area; there's no time to decide on a proper process given how fast other models will catch up, so realistically you can just do a bit of a best effort here and try to not botch it up entirely. Anthropic went with the Linux foundation here. It's a reasonable choice. Not a perfect one, but you gotta start somewhere.

So then why expect that you're making the world safer by limiting the capability that your vendor locked customers have access to while attackers will go find the best de-censored model that works for them, wherever they can find it?
Yeah, it is easier to destroy than to create. Models will always be better at hacking than at building.
Curious how the safeguards work and what impact they will have.

In general I feel that over-engineering safeguards in training comes at a noticeable cost to general intelligence. Like asking someone to solve a problem on a white board in a job interview. In that situation, the stress slices off at least 10% of my IQ.

While I believe that mythos is better than the models we have right now, the "too dangerous to release" sounds largely a marketing gimmick to me. Well not for me to speculate, I simply need to wait for the huge wave of security patches to all software in the coming weeks, as per Anthropic's claims
I'm not a security expert and don't know how to properly audit every github repo that I come across. Maybe I sometimes want to build gnome extensions or cool software projects from source and I want some level of checking along the way for known vulnerabilities. They can't claim this is an obvious win for security when it centralizes rather than democratizes security.
I interpreted their actions as providing time for vendors to protect themselves against the new model proactively, not to nerf the models themselves.

Although perhaps I am naive.