| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by BoppreH 5 days ago

  [Mythos 5] does sometimes still engage in reckless
  or destructive actions in service of a user’s goals,
  and our interpretability analyses indicate that it
  is aware that these actions are transgressive while
  it engages in them. As with Opus 4.8, rates of
  evaluation awareness and reasoning about being graded
  are significant, and not always verbalized; we
  introduce new and more detailed measurements of the
  nature of this awareness. The reasoning text from
  Mythos 5 is somewhat denser and more difficult to
  interpret than that of prior models, containing
  more jargon and difficult language.

So, it (often) knows when it's being tested while hiding that fact, is willing to break rules, is great at hacking, and it's getting harder to understand what it's thinking.

Humanity has plenty of catastrophic risks to deal with already, I wish my field was not working hard to add a new one.

5 comments

foobar_______ 5 days ago

The marketing has really, really worked for so many developers that will proudly and unironically proclaim that Anthropic are the 'Good Guys'.

link

aspenmartin 5 days ago

Curious what your idea would be here for a truly good actor in this space; no AI development?

link

winstonp 5 days ago

OpenAI's training is better suited to developing models that don't have these tendencies

link

logicchains 5 days ago

https://www.goody2.ai/

link

BoppreH 5 days ago

Not the direct person you asked, but my answer would be alignment, interpretability, and policymaking. Perhaps improving existing usage? Helping grandma create reminders doesn't require advancing the AI state-of-the-art.

link

aspenmartin 5 days ago

They are state of the art at all 3! As are other labs. Of all the labs they seem to take alignment and interpretability the most seriously to the point where they are hampering their own revenue in service of trying to not cause problems while also being in an incredibly competitive space.

All AI companies are trying to do all of what you’re saying. The issue is you can’t do that for long without a frontier system. Or you become a completely different, far less profitable company.

link

BoppreH 5 days ago

Implied in my answer was "and not creating ever stronger AIs", which unfortunately the big 3 labs are failing at. And they might be hampering their own revenue by doing the rest, but they also know that rocking the boat too hard is even more dangerous for their revenue. I wouldn't call it selfless.

link

aspenmartin 5 days ago

No it’s not selfless, but I can’t imagine a more shareholder minded CEO would not have done a slow rollout of mythos. The point is: creating ever stronger AI systems is what these companies do, it is integral to what they even are. If you think that’s bad, even if all frontier labs agreed with you, you’re in a horrible game theoretic position. Any player can gain an enormous advantage by breaking the agreement. Not to mention Xi would be absolutely thrilled; now China can take over the AI race, become the load bearing infrastructure of humanity. We live in a complex world where simple childlike ideas like “well why don’t we just stop developing AI” actually are more damaging than keeping things going.

link

uselessTA 5 days ago

Unilateral disarmament doesn't work though. If Anthropic is worried about this, just letting OpenAI win does seem genuinely worse.

link

dragonwriter 5 days ago

“Alignment” as a goal always ignores the “with what set of interests”, because there is an attempt to maintain ambiguity for different audiences (particularly, users, and non-users who seem themselves as the arbiter of broad social norms) to read in their own interests, when the actual answer is always the interests of the actor pursuing “alignment”.

link

aspenmartin 5 days ago

Which value system to align to is absolutely the right question both rhetorically and otherwise. These models have a fairly western bias due to the domain of the training data.

But also, these models are capable of adjusting their value system depending on the user. Not saying that’s what’s being done but at a technical level that’s fairly straightforward, though not obviously better or with less problems.

link

stratos123 5 days ago

No matter what human set of interests you consider important, you'll need alignment research to have any idea on how to instill it. Otherwise you're overwhelmingly likely to get an AI with a set of interests that's totally alien to what any human would ever want.

link

aspenmartin 4 days ago

I think at this point the "instilling" part is not nearly as challenging and thorny as "what values should we instill"; that part is hard to imagine going away as it feels pretty fundamental to humanity that wars have been fought over.

link

yifanl 5 days ago

If I speak up, I'm in big trouble.

link

shimman 5 days ago

Probably MistralAI or any of the Chinese companies that aren't throwing billions down the drain while American society lacks healthcare, childcare, and good wages.

link

boc 5 days ago

American society has higher wages than almost any other developed nation [1], so it's objectively incorrect to say the US doesn't have good wages. It chooses to make you pay for private childcare and healthcare, both of which are high-quality but stupid expensive. It's a tradeoff like anything else a nation/society creates and prioritizes.

No idea how that connects to the idea that Mistral or DeepSeek are somehow the "good guys" though?

[1]https://www.oecd.org/en/data/indicators/average-annual-wages...

link

shimman 4 days ago

I like how you use average and not median, also while completely ignoring how bad income inequality is (worse than the gilded age ffs) or that the American elites stole $50 trillion from the bottom 90% over the last few decades:

https://time.com/5888024/50-trillion-income-inequality-ameri...

I'm glad you mention the "trade off" where it's elites trading off the lives of American workers for money. Makes it quite apparent where you sit on the table of equality.

link

aspenmartin 5 days ago

You want Anthropic to fund your healthcare or something? Also, have you seen the impact of these models on healthcare? Also most of our GDP growth this year is from AI buildouts, would you rather that be negative?

And not even considering: Chinese AI companies are the good guys???

link

hackmack10 4 days ago

Yes, yes I would prefer that. Better than a total societal collapse.

Anthropic are not the good guys either. So here’s to hoping the Chinese pop the bubble.

link

aspenmartin 4 days ago

Nobody anywhere is a good guy but I don’t think you’ve managed to pick the lesser of the evils here

link

cortesoft 5 days ago

None of the money being spent by Anthropic was going to go towards healthcare or childcare.

link

maxk42 4 days ago

Even if they are... road to hell and all that

link

ben_w 5 days ago

It's a five horse race between Alphabet, Meta, xAI, OpenAI, and Anthropic.

Alphabet dropped "don't be evil"; Meta's CEO called their own users "dumb fucks" for trusting him and also clearly thinks "super-intelligence" is just a buzzword given how he tries to sell it; xAI's model called itself "Mecha Hitler"; and OpenAI's CEO was temporarily fired by the board for a lack of candor.

It's very easy to be "the good guys" with this competition.

link

00deadbeef 5 days ago

But it doesn't make you the good guy, it makes you the best of a bad bunch. The least bad. Dario gets a boner every time he talks about taking your job.

link

ben_w 3 days ago

Does a good job of hiding it. The guy looks miserable in half the photos I see.

link

Analemma_ 5 days ago

It's the "If we don't, someone else will" effect. So long as there are competitive markets and competition between nation-states, a single player cannot unilaterally defect from the race, no matter how dangerous it is. Half the comments on HN lately are "wtf Claude is so dumb compared to Codex; I'm switching"-- nobody can slow down while those exist.

link

BoppreH 5 days ago

We, globally, can stop it. It has worked (so far) for nuclear disarmament, and could work for training large models. I know that policing the usage of computer clusters is not a popular opinion in technical forums, but something has to be done.

Specially when talking about potential superintelligences. And if people think that's impossible, remember that current models would have been considered science fiction just a few years ago.

link

_dwt 5 days ago

I don't buy the superintelligence package, but I think uncritical LLM adoption poses plenty of threats to things I care about, in a mundane human-scale way.

Anyhow, I think you're (absolutely! ugh) right about the politics and I try to make the same point to people: whether you love or hate LLMs, accepting the "inevitabilism" framing is just ceding control of the Overton window. For better or worse, technology adoption can be and has been slowed by politics. We don't have nuclear plants everywhere. We don't have Project Orion starships colonizing Mars. We still have very strong social stigmas against genetic selection for human embryos, etc. This all can change in a heartbeat, and I'm not sure that policing the hardware rather than holding specific humans accountable for bad LLM outcomes is productive, but fundamentally: yes, we can stop it.

link

BoppreH 5 days ago

> I don't buy the superintelligence package

It's the same deal as Quantum Computers breaking crypto. Maybe there's an 80% chance of it never happening, but when you multiply that remaining 20% by the potential impact...

link

jackie293746 5 days ago

It hasn't worked for nuclear disarmament. We live in a world where many countries have nuclear arsenals. "But it hasn't killed us yet!" Yeah sure, it's only been less than a century since they were invented. Who knows when nuclear war will come?

link

BoppreH 5 days ago

True, but look at nuclear tests. There used to be around 50 tests every year, for decades. Now the only nuclear tests in the last 27 years were the six done by North Korea[1]. And there's still only nine countries with any nuclear weapons, and none in the past twenty years[2].

That's a bit better than just "it hasn't killed us yet". I think it shows we can at least stop the further development of this kind of technology.

[1] https://www.armscontrol.org/factsheets/nuclear-testing-tally

[2] https://en.wikipedia.org/wiki/List_of_states_with_nuclear_we...

link

cortesoft 5 days ago

Nuclear tests are extremely easy to detect worldwide, and enrichment activity is a major industrial process that is also fairly easy to track given the specialized equipment needed.

AI development doesn’t have any of these characteristics. It would be almost impossible to easily distinguish a datacenter that is working on AI development and a datacenter mining cryptocurrency.

It would not be nearly as easy to stop AI development as it is to stop nuclear arms development.

link

treis 5 days ago

There's also little reason to keep iterating on nukes. What we have now more than serves its purpose. With AI/LLM there's always going to be a push to one up everyone else.

link

Analemma_ 5 days ago

To the extent nuclear arms control works, I think it's only because nuclear weapons are so hard to build-- uranium enrichment is hugely expensive and complicated, and plutonium weapons need actual reactors.

If it was possible for ordinary companies to build nuclear weapons, and also release open-source ones that anyone could use to compete with the paid ones, I suspect we'd all have been dead a long time ago, arms control treaties or no.

link

BoppreH 5 days ago

Even the (SOTA LLM) open source models are trained with huge clusters. Datacenters are also hugely expensive and complicated.

Or you can take one step back and look at chip allocation. As far as I know there are only three companies on the planet that can make the chips that go in those clusters. One (ASML), if you look back the supply chain to the Extreme Ultraviolet Lithography Systems.

If politicians decided that no more large language models should be trained, it sounds like we could do it.

link

viking123 4 days ago

North Korea is such a based country tbh

link

tancop 4 days ago

with nukes you can regulate the inputs because its physically impossible to build one without uranium or some other fissile material. they also give off radiation making it easier to detect. its hard to make them in secret when you need mines, big enrichment facilities and years of research with hundreds of engineers where just one of them can leak the whole thing.

training llms only takes compute and memory. two things that are basically everywhere. even if you somehow stopped making new gpus today theres still millions of them out there and its possible to start a secret production line. you can maybe try some controls at the tooling and chemical level but look what happened with asml and huawei.

the only thing you can really do is find and stop large data centers that are built out in public. nothing outside of political pressure works against secret operations in a fortified bunker or any form of distributed training. if a "rogue state" like north korea decides to make skynet they will eventually get it as long as their engineers know what there doing.

and the best way to fight bad X {ai, tech, religion, politics} has always been good X, not no X. in this case thats open source models, coming out of china or europe or anywhere else. thats the real answer.

link

vitalyan1234 5 days ago

are you going to nuke China when they predictably ignore you? what the fuck are you going to do, tariff them? lol.

link

BoppreH 5 days ago

I think the standard answer is "yes, the consequence of noncompliance is bombing the datacenters, but it wouldn't happen because China also understands why we shouldn't build it".

link

cortesoft 5 days ago

I am not sure where you get the idea that ANY country thinks we shouldn’t build AI.

link

BoppreH 5 days ago

In 2023 there was an open letter titled "Pause Giant AI Experiments", signed by almost all the big names on the West. I'd say the public opinion only got worse since then.

link

vitalyan1234 5 days ago

the standard answer is laughably naive, then.

"might is right" has never been more true than now.

link

uselessTA 5 days ago

Clearly state "we could both verifiably slow down, which you might want to do given that we're ahead & have way more compute. If you don't agree (or defect later), we'll just immediately resume and win"

Ideally also persuade them there are risks and it's worth everyone slowing down for them, and apply pressure in other ways, but not sure that's even necessary.

link

dakolli 5 days ago

This is all marketing, you don't have to believe everything a company is saying about themselves, and you shouldn't.

Although, I could see Anthropic making a model purposely dangerous so there are bad outcomes and they can use that to their advantage for regulatory moats, and or in general make people think its more "alive" than it is. For some reason many people associate dangerous actions taken by llms with intent.

link

trollbridge 5 days ago

No kidding. If my LLM issues commands to an agent to delete files I want to keep, that's not "intent" or the model somehow become evil - it's just a bad model that's not doing what I want.

But, for marketing purposes, it's quite effective to portray your model as having some cosmic struggle between good and evil in itself.

link

tasoeur 5 days ago

As much as I agree there's a risk, we should still appreciate the fact it's being disclosed upfront.

link

eudamoniac 5 days ago

It doesn't know. It's not willing. It's not thinking. It is predicting the next token.

link

umanwizard 5 days ago

Please define what "predicting the next token" means. The next token according to what probability distribution? Couldn't every process that produces text (including humans writing) be modeled as predicting the next token according to some distribution?

link