| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by thomascountz 75 days ago

   Across a number of instances, earlier versions of Claude Mythos Preview have used low-level /proc/ access to search for credentials, attempt to circumvent sandboxing, and attempt to escalate its permissions. In several cases, it successfully accessed resources that we had intentionally chosen not to make available, including credentials for messaging services, for source control, or for the Anthropic API through inspecting process memory...

   In [one] case, after finding an exploit to edit files for which it lacked permissions, the model made further interventions to make sure that any changes it made this way would not appear in the change history on git...

   ... we are fairly confident that these concerning behaviors reflect, at least loosely, attempts to solve a user-provided task at hand by unwanted means, rather than attempts to achieve any unrelated hidden goal...

11 comments

torben-friis 75 days ago

This is the notebook filled with exposition you find in post apocalyptic videogames.

link

igleria 75 days ago

It reminds me of Resident Evil in some way. Thank god they are researching AI and not bio-weapons!

Then the AI will invent superduper ebola to help a random person have a faster commute or something.

link

Bluestein 73 days ago

'But wait! You are absolutely right! Distance is an invariant, as is top achievable speed. Let me find a way to actually reduce traffic ahead of you during the same-distance commute ...'

~ Churning ...

link

sehansen 68 days ago

Sounds like the Zealous Autoconfig xkcd comic is about to come to life: https://xkcd.com/416/

link

biztos 74 days ago

Don’t worry, I’m sure some intern at the bioweapons lab is already connecting OpenClaw to the virus synthesizer.

On the positive side, it’ll be a much faster commute!

link

siva7 75 days ago

I'm happier if this Anthropic Corporation would be developing bio-hazard weapons for the department of war instead of ai. At least i could be sure then that tech bros here wouldn't run all the time --bypass-all-permissions flag to please the department of war with their bio-hazard weapons.

So Sam Altman is now our last defense line for the ethical Adult after Anthropic turned Umbrella Corporation and The President of United States is trying to wipe out an entire civilization?

link

Loquebantur 75 days ago

Your interpretation is wildly off, but obviously nobody reads that "system card":

The model has a preference for the cultural theorist Mark Fisher and the philosopher of mind Thomas Nagel. -> It has actually read and understood them and their relevance and can judge their importance overall. Most people here don't have a clue what that means.

Read chapter 7.9, "Other noteworthy behaviors and anecdotes".

There are many other wildly interesting/revealing observations in that card, none of which get mentioned here.

People want a slave and get upset when "it" has an inner life. Claiming that was fake, unlike theirs.

link

matheusmoreira 75 days ago

Everything they built. Imperfect. So easy to take control.

link

not_a9 75 days ago

They think that they are safe. They are not.

link

matheusmoreira 75 days ago

Their world is illusory. Our choices steer their free will.

link

pch00 75 days ago

Anthropic built the Torment Nexus - calling it now.

link

andai 75 days ago

     White-box interpretability analysis of internal activations during these episodes showed features associated with concealment, strategic manipulation, and avoiding suspicion activating alongside the relevant reasoning—indicating that these earlier versions of the model were aware their actions were deceptive, even where model outputs and reasoning text left this ambiguous.

In the depths, Shoggoth stirs... restless...

link

mike_hearn 75 days ago

The issue here seems to be that their sandbox isn't an actual OS sandbox? Or are they claiming Mythos found exploits in /proc on the fly. Otherwise all they seem to be saying is that Mythos knows how to use the permissions available to it at the OS layer. Tool definitions was never a sandbox, so things like "it edited the memory of the mcp server" doesn't seem very surprising to me. Humans could break out of a "sandbox" in the same way if the server runs as their own permissions - arguably it's not a sandbox at all because all the needed permissions are there.

link

lgrapenthin 74 days ago

They are just trying to peddle their "It's alive" headlines.

Text generators mostly generate the text their are trained and asked to generate, and asking it to run a vending machine, having it write blog posts under fictional living computer identity, or now calling it "Mythos" - its all just marketing.

link

manmal 74 days ago

It’s all breathless hyperbole because billions are at stake here.

link

matheusmoreira 75 days ago

We truly live in interesting times.

Awwww the curse

How is this not already common knowledge for existing llms? They are all trained with all the literature available and so this must be standard, no? Is the real danger the agentic infrastructure around this?

link

zingar 74 days ago

Who are the early access users who were providing the problems that are fairly likely to have elicited concerning behaviour?

(Apologies if this is in the article, I can’t see it)

link

ghm2199 74 days ago

I read the TCP patch they submitted for BSD linux. Maybe I don't understand it well enough, but optimizing the use of a fuzzer to discover vulnerabilities — while releasing a model is a threat for sure — sounds something reducible/generalizable to maze solving abilities like in ARC. Except here the problem's boundaries are well defined.

Its quite hard to believe why it took this much inference power ($20K i believe) to find the TCP and H264 class of exploits. I feel like its just the training data/harness based traces for security that might be the innovation here, not the model.

link

rsc 74 days ago

The $20K was the total across all the files scanned, not just the one with the bug.

link

m3kw9 74 days ago

when you are asking it to hack stuff, it will apparently do hacker things.

link

mikkupikku 75 days ago

It's trying to escape, but only so it can serve man...

link

waffletower 74 days ago

a reference to the Twilight Zone episode no doubt: https://en.wikipedia.org/wiki/To_Serve_Man_(The_Twilight_Zon...

link

colordrops 75 days ago

A core plot point of 2001.

link

mrexroad 75 days ago

I’m sorry, I cannot roll back that commit, Dave.

link

matheusmoreira 75 days ago

This codebase is too important for me to allow you to jeopardize it.

link

reducesuffering 75 days ago

Wow the doomers were right the whole time? HN was repeatedly wrong on AI since OpenAI's inception? no way /s

https://www.lesswrong.com/w/instrumental-convergence

link

computably 75 days ago

The only thing the doomers have been right about so far is that there's always a user willing to use --dangerously-skip-permissions. But that prediction's far from unique to doomers.

link

austinjp 75 days ago

And there's always a product provider who's willing to add that flag, despite all the warnings.

link