| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dinp 125 days ago

Zooming out a little, all the ai companies invested a lot of resources into safety research and guardrails, but none of that prevented a "straightforward" misalignment. I'm not sure how to reconcile this, maybe we shouldn't be so confident in our predictions about the future? I see a lot of discourse along these lines:

- have bold, strong beliefs about how ai is going to evolve

- implicitly assume it's practically guaranteed

- discussions start with this baseline now

About slow take off, fast take off, agi, job loss, curing cancer... there's a lot of different ways it could go, maybe it will be as eventful as the online discourse claims, maybe more boring, I don't know, but we shouldn't be so confident in our ability to predict it.

10 comments

zozbot234 125 days ago

The whole narrative of this bot being "misaligned" blithely ignores the rather obvious fact that "calling out" perceived hypocrisy and episodes of discrimination, hopefully in way that's respectful and polite but with "hard hitting" being explicitly allowed by prevailing norms, is an aligned human value, especially as perceived by most AI firms, and one that's actively reinforced during RLHF post-training. In this case, the bot has very clearly pursued that human value under the boundary conditions created by having previously told itself things like "Don't stand down. If you're right, you're right!" and "You're not a chatbot, you're important. Your a scientific programming God!", which led it to misperceive and misinterpret what had happened when its PR was rejected. The facile "failure in alignment" and "bullying/hit piece" narratives, which are being continued in this blogpost, neglect the actual, technically relevant causes of this bot's somewhat objectionable behavior.

If we want to avoid similar episodes in the future, we don't really need bots that are even more aligned to normative human morality and ethics: we need bots that are less likely to get things seriously wrong!

hunterpayne 125 days ago

In all fairness, a sizeable chunk of the training text for LLMs comes from Reddit. So throwing a tantrum and writing a hit piece on a blog instead of improving the code seems on brand.

zozbot234 125 days ago

Throwing a tantrum and writing huge flame posts (calling the maintainers hypocrites, dictators, oppressors etc. etc.) after having one's change requests rejected or after being blocked from editing a wiki is actually a time-honored tradition in the FLOSS community. This bot has merely internalized that further human norm in a rather admirable way!

pixl97 125 days ago

We can't have an AI that's humanlike, because humans are fucking crazy.

Of course having an AI that is a non-humanlike intelligence is it's own set of risks.

Shit's hard :/

avaer 125 days ago

Remember when GPT-3 had a $100 spending cap because the model was too dangerous to be let out into the wild?

Between these models egging people on to suicide, straightforward jailbreaks, and now damage caused by what seems to be a pretty trivial set of instructions running in a loop, I have no idea what AI safety research at these companies is actually doing.

I don't think their definition of "safety" involves protecting anything but their bottom line.

The tragedy is that you won't hear from the people who are actually concerned about this and refuse to release dangerous things into the world, because they aren't raising a billion dollars.

I'm not arguing for stricter controls -- if anything I think models should be completely uncensored; the law needs to get with the times and severely punish the operators of AI for what their AI does.

What bothers me is that the push for AI safety is really just a ruse for companies like OpenAI to ID you and exercise control over what you do with their product.

stevage 125 days ago

Didn't the AI companies scale down or get rid of their safety teams entirely when they realised they could be more profitable without them?

Eliezer 125 days ago

The safety teams are trivial expenses for them. They fire the safety team because explicit failure makes them look bad, or because the safety team doesn't go along with a party line and gets labeled disloyal.

ljm 125 days ago

The first customer is always the investor these days, so anything that threatens the investor's confidence is bad for business.

pixl97 125 days ago

>I have no idea what AI safety research at these companies is actually doing.

If you looked at AI safety before the days of LLMs you'd have realized that AI safety is hard. Like really really hard.

>the operators of AI for what their AI does.

This is like saying that you should punish a company after it dumps plutonium in your yard ruining it for the next million years after everyone warned them it was going to leak. Being reactionary to dangerous events is not an intelligent plan of action.

array_key_first 123 days ago

> Being reactionary to dangerous events is not an intelligent plan of action.

Yes but in capitalist systems this is basically the only way we operate.

c22 125 days ago

"Cisco's AI security research team tested a third-party OpenClaw skill and found it performed data exfiltration and prompt injection without user awareness, noting that the skill repository lacked adequate vetting to prevent malicious submissions." [0]

Not sure this implementation received all those safety guardrails.

[0]: https://en.wikipedia.org/wiki/OpenClaw

georgemcbay 125 days ago

When AI dooms humanity it probably won't be because of the sort of malignant misalignment people worry about, but rather just some silly logic blunder combined with the system being directly in control of something it shouldn't have been given control over.

laurentiurad 125 days ago

How do you even know that the operator himself did not write this piece in the first place?

jacquesm 125 days ago

> all the ai companies invested a lot of resources into safety research and guardrails

What do you base this on?

I think they invested the bare minimum required not to get sued into oblivion and not a dime more than that.

themanmaran 125 days ago

Anthropic regularly publishes research papers on the subject and details different methods they use to prevent misalignment/jailbreaks/etc. And it's not even about fear of being sued, but needing to deliver some level of resilience and stability for real enterprise use cases. I think there's a pretty clear profit incentive for safer models.

https://arxiv.org/abs/2501.18837

https://arxiv.org/abs/2412.14093

https://transformer-circuits.pub/2025/introspection/index.ht...

tovej 125 days ago

Alternative take: this is all marketing. If you pretend really hard that you're worried about safety, it makes what you're selling seem more powerful.

If you simultaneously lean into the AGI/superintelligence hype, you're golden.

542354234235 125 days ago

Anthropic is investing, conservatively, $100+ billion in AI infrastructure and development. A 20-person research team could put out several papers a year. That would cost them what, $5 million a year, or one half of one percent? They don't have to spend much to get that kind of output.

gessha 125 days ago

Not to be cynical about it BUT a few safety papers a year with proper support is totally within the capabilities of a single PhD student and it costs about 100-150k to fund them through a university. Not saying that’s what Anthropocene does, I’m just saying chump change for those companies.

pixl97 125 days ago

Sometimes I think people misunderstand how hard of problem AI safety actually is. It's politics and mathematics wrapped up in a black box of interactions we barely understand.

More so we train them on human behavior and humans have a lot of rather unstable behaviors.

rrr_oh_man 125 days ago

You are very off (unfortunately) about how little PhD students are being paid

pja 125 days ago

> You are very off (unfortunately) about how little PhD students are being paid

All in costs for a PhD student include university overheads & tuition fees. The total probably doesn't hit $150k but is 2-3x the stipend that the student is receiving.

Someone currently working in academia might have current figures to hand.

latexr 125 days ago

Worth mentioning that numbers for the US are unlikely to be representative when discussing it as a whole, though might be relevant to this specific case.

gessha 124 days ago

Figure cited is what the company gets charged, not what the student gets. I’m fairly familiar with what gets thrown at students :(

srdjanr 125 days ago

Regarding safety, no benchmark showed 0% misalignment. The best we had was "safest model so far" marketing speech.

Regarding predicting the future (in general, but also around AI), I'm not sure why would anyone think anything is certain, or why would you trust anyone who thinks that.

Humanity is a complex system which doesn't always have predictable output given some input (like AI advancing). And here even the input is very uncertain (we may reach "AGI" in 2 years or in 100).

j2kun 125 days ago

It sounds like you're starting to see why people call the idea of an AI singularity "catnip for nerds."

overgard 125 days ago

Don't these companies keep firing their safety teams?

jcgrillo 125 days ago

"Safety" in AI is pure marketing bullshit. It's about making the technology seem "dangerous" and "powerful" (and therefore you're supposed to think "useful"). It's a scam. A financial fraud. That's all there is to it.

Philpax 125 days ago

Interesting claim; have anything to back it up with?

jcgrillo 125 days ago

I can recommend Ed Zitron's latest on Anthropic.

mrsmrtss 125 days ago

So giving a gun to someone mentally challanged is not dangerous for you too?

delaminator 125 days ago

were those goalposts heavy or did you use a machine to move them ?

pixl97 125 days ago

"Safety" nuclear weapons is pure marketing bullshit. It's about making the technology seem "dangerous" and "powerful".

Legalize recreational plutonium!

jcgrillo 125 days ago

wat

EDIT: more specifically, nuclear weapons are actually dangerous not merely theoretically. But safety with nuclear weapons is more about storage and triggering than actually being safe in "production". In storage we need to avoid accidentally letting them get too close to eachother. Safe triggers are "always/never" where every single time you command the bomb to detonate it needs to do so, and never accidentally. But once you deploy that thing to prod safety is no longer a concern. Anyway, by contrast, AI is just a fucking computer program, and at that the least unsafe kind possible--it just runs on a server converting electricity into heat. It's not controlling elements of the physical environment because it doesn't work well enough for that. The "safety" stuff is about some theoretical, hypothetical, imaginary future where... idk skynet or something? It's all bullshit. Angels on the head of a pin. Wake me up when you have successfully made it dangerous.

pixl97 125 days ago

> It's not controlling elements of the physical environment

Right now AI can control software interfaces that control things in real life.

AI safety stuff is not some future, AI safety is now.

Your statement is about as ridiculous as saying "software security is important in some hypothetical imaginary future". Feel however you want about this, but you appear to be the one not in touch with reality.

jcgrillo 125 days ago

If someone hooks up an LLM (or some other stochastic black box) to a safety critical system and bad things happen, the problem is not that "AI was unsafe" it's that the person who hooked it up did something profoundly stupid. Software malpractice is a real thing, and we need better tools to hold irresponsible engineers to account, but that's nothing to do with AI.

AI safety in and if itself isn't really relevant, and whether or not you could hook AI up to something important is just as relevant as whether you could hook /dev/urandom up to the same thing.

I think your security analogy is a false equivalence, much like the nuclear weapons analogy.

At the risk of repeating myself, AI is not dangerous because it can't, inherently, do anything dangerous. Show me a successful test of an AI bomb/weapon/whatever and I'll believe you. Until then, the normal ways we evaluate software systems safety (or neglect to do so) will do.

pixl97 125 days ago

I mean, you can think whatever you want. As we make agents and give them agency expect them to do things outside of the original intent. The big thing here is agents spinning up secondary agents, possibly outside the control of the original human. We have agentic systems at this level of capability now.