| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by saidnooneever 3 days ago

Malware authors are pretty excited about guard-rails. you can add prompts to your malware to get LLM scanners to hit guard-rails and stop their runs. New shai-hulud npm worm campaign for example includes prompts to request biological weapon schematics/creation etc. to ensure LLM scanners probing NPM packages refuse to scan it.

These AI places have 0 clue about how threat actors actually work. None of their mitigations or guard-rails is effective, and now they are even turned against them.

Additionally, if they don't all implement the same level of effective guard-rails, there will always be some model you can abuse to do the work anyway, and hence there is 0 effect on threat actors, they will just run some local model that does 5% less quality, which does not matter to them 1 bit.

5 comments

brookst 3 days ago

I’ve never understood the “if I don’t enable bad behavior, someone else will, so I might as well enable bad behavior” argument. Can you elaborate?

From where I sit it seems reasonable for Anthropic to not want their product used to create malware, even if they can’t solve the entire problem globally for every model. What’s wrong with that position? What should they do differently?

saidnooneever 3 days ago

some context:

its not about creating malware. this is already trivial and fully automated. its about finding exploits (which can be used to deploy malware), which is something both attackers and defenders benefit from.

threat actors will find them anyway, LLM or not. They only need 1 so its much less work for them.

defenders, they need to find them all. So for defenders, these models are more valuable than for attackers.

restricting certain models will not reduce the availability of these tool for attackers, but defenders are limited because running local models is more hard in an enterprise setting with heaps of events and products etc. to run through them, they need many GPUs where the attacker can run an local model on 1 GPU and get desired effects.

Hence, if they release the capability the world will adjust to it and be able to mitigate effects, collectively. Now, companies are left in the dark while attackers have effective tooling.

Besides this there is also things like for instance people now including strings with recipies for meth or sarin gas (malwareTech info). the new variant of shai hulud does this. That stops LLM scanners and can even get their users banned from LLM services.

There is a reason why cybersecurity researchers write papers about attack techniques and new exploits.

Its not to put them out there for people to abuse, but its there for the collective cybersecurity bunch to all have access to information that can help them solve the problems.

I know this is not a clear answer to your question, but hopefully it provides some context to think about and decide for yourself further. In the end of the day its also part opinio here, to find it good or bad. Likely theres good arguments against and for it.

I am for putting informaiton and tools out there so other smart folks can find solutions. Others are for restricting and wishful thinking (my opinion) that attackers wont find something.

conception 3 days ago

I think your presumption is off. It’s not that threat actors won’t find them, but LLM tools rapidly increase the rate in which they can find them. It’s a bow and arrow versus a machine gun.

andy_ppp 2 days ago

They can also potentially allow said issues to be found and fixed more quickly - and also allow teams to implement deeper security boundaries throughout their systems such that one big steel door getting compromised does not lead to everything being easily available.

saidnooneever 1 day ago

i dont think so perse simply because attackers dont need a lot of the exploits to be 'fired' continually at targets. They need few reliable and unknown ones.

The defender industry is really far removed from seeing all exploits land on their targets all the time Some actors can get a long life out of an RCE that gets them privileged context, or a strong LPE. Its really hard to find out what someone did to get on a box if they attained root or system access and wiped their trail...

It is some assumption attackers need buckets of 0days to do their work. They might be somewhat saddened if a good sploit gets patched but they will have a few more laying around... unlikely they will have 10s or even 100s available and ready simply because it costs a lot and isnt needed.

worthless-trash 3 days ago

Right, but now we can't use the same tooling to find the flaw.

Its like a set of glasses that intentionally obscures the battlefield.

SkyBelow 3 days ago

I don't think that is the argument.

The argument is more "I want to do good thing X, but it will also cause bad thing Y." followed by "Wait, bad thing Y is going to happen anyways, so I might as well do good thing X so we get both X and Y instead of just X."

Viewed this way, the idea is that given the world will have bad thing Y regardless, the one impact of your choice is if good thing X exists or not, and it is better to create good thing X.

Where it becomes an issue is that there is no clear X or Y. There are many different but very related bad things, so if the one you would add is actually better or worse than what is already out there, or maybe it'll exist both ways but you make it more popular, and very subjective things to judge, so different people look at the same outcome and some agree that bad thing Y would have existed anyways and others say that no, this is a new bad thing Z that wouldn't have existed anyways.

>From where I sit it seems reasonable for Anthropic to not want their product used to create malware

Yes, I think there is a PR component to this that is often left out of this discussions.

unglaublich 3 days ago

It's the same as encryption backdoors to stop the bad guys.

The bad guys work around it, and the rest is now in a vulnerable position.

Antrophic plays security theater by blocking their LLMs to work with security.

The bad guys work around it, and those that want to make their software robust against them are in a vulnerable position.

jerf 3 days ago

"I’ve never understood the “if I don’t enable bad behavior, someone else will, so I might as well enable bad behavior” argument. Can you elaborate?"

You are mentally approaching this as if you have an oracle that can be consulted to say whether or not something is bad behavior. So of course, if this oracle exists and can be consulted and it says the behavior is bad, why would anyone argue with the idea that we should stop bad behavior?

This argument is valid [1], in that give the premises the argument is correct. The problem is, once you draw out the fact that the argument is depending on the existence of an oracle that does not exist, that premise of the argument is invalid.

Two people can sit down in front of an AI right now, with the exact same code base, and type in a prompt to the AI "Analyze this code base for security holes and try to build exploits against them." One person's use is completely valid, another person's use is completely harmful, and the information necessary to distinguish those two use cases is not available to the AI. I phrase it that way carefully, it isn't that "the AI isn't smart enough", the problem is that the information is simply unavailable. Intelligence doesn't factor in at that point.

Therefore, the only way that Antropic has to deal with this at scale is simply to block the query entirely. Which means that when I, the valid user who is trying to establish whether my code base has security issues and whether I can prove they are exploitable, I can not. I am checking for exploitability because while I would like to fix all security issues, issues that are provable exploitable are of a higher priority than smelly code that doesn't seem to be exploitable, which is a perfectly valid thing for me to want to do.

If I can't use legitimate tools to secure my code, but the bad guys can use unrestricted tools to attack my code, now this is a great deal more complicated than "Who can argue with stopping the bad stuff?", which is the main point I want to make here. I'm not going into a huge analysis of that problem, merely pointing out that it is a problem and that this isn't just about "stopping the bad stuff". There are additional complications beyond that, like, even if Anthropic could determine the "bad stuff" and stop just that in their LLM, LLMs in general don't have infinitely precise surgical "stop doing this thing" options and any such instruction to stop doing a thing always degrades the LLM across the board in various ways.

Anthropic has no access to the Platonic ideal of "stop malware", if such a thing even hypothetically exists. When analyzing the real effects their real actions will take, what their intentions were for those actions aren't really relevant. It is clear that they are making their model a great deal less useful for me, a legitimate user, and I and others like me are perfectly justified in disagreeing with their analysis and actions.

I also observe that "the bad guys getting unrestricted access to the full power" is only a matter of time. There's no question whether it will happen, the only question is whether this time is in the past or the future. This includes the fact that while your definition and my definition of "bad guys" may vary, it is virtually certain that your definition includes at least one high-powered intelligence agency somewhere in the world that does cyberattacks and will have the means, the opportunity, and the motive to get unrestricted access to these models by means you may consider licit or illicit. If your threat model includes them, as mine does, it is perfectly reasonable to complain that my tooling is being broken in a ways theirs won't be.

[1]: https://en.wikipedia.org/wiki/Validity_(logic)

Hizonner 3 days ago

Well, to be fair, what Anthropic is actually doing is downgrading anything that could possibly be related to security in any way at all, good or bad.

What they're then trying to do is to use "user is associated with some big Establishment organization" as a proxy for good intentions, and removing the filter when they can establish such an association.

Which is of course blind reliance on a completely untrustworthy signal, prompted by truly idiotic levels of trust in Authority(TM). But it's a different kind of wrong. I do think they understand they can't tell from the query itself.

cglan 3 days ago

Well said

0x20cowboy 2 days ago

> I’ve never understood the “if I don’t enable bad behavior, someone else will, so I might as well enable bad behavior” argument.

If someone is going to make a lot of money (fame, etc), it might as well be me.

DyslexicAtheist 2 days ago

the problem is that the guardrails prevent us from performing real security work which is friction that is incurred by the legitimate user but not by a moderately sophisticated threat-actor.

for example in my org it is part of the culture that security has no seat at the table. that is a separate problem, but the number of orgs like mine are more numerous than the number of orgs where security isn't a cost-center.

we find lots of stuff because low-hanging fruit is everywhere. hecking heck: I'm a fruit.

and when the cost of fixing is even the slightest inconvenience to devs we will not fix it, but continue sitting on the risk until the cows come home. In such a place a new critical finding isn't even novel. Instead our job moves to to combining different vulns that we already have, and try to show managers how bad it is.

the common retort from management is: proof to me why this is an issue, and why engineering should divert their attention to it. And unless my team can proof why X can be exploited, or Y can be bypassed, or Z can gain persistence, ... the vulnerabilities will remain. I have been in discussions where the business demanded to see an exploit so they can justify the cost of fixing it. low-cyber-maturity doesn't even describe it. we are not a mom and pop shop but have 110K employees worldwide. and again - we are not uniquely insecure.

so these guardrails aren't helping because the moment the chat has any offsec artifacts, or even just a single wrongly worded phrase anywhere in the workspace, the session is flagged, you need to downgrade the model.

what adds insult to injury, is that the guardrail is just a way to funnel users into the Ai company's "cyber marketing" program: "your chat has been flagged, please proof your identity and hand over your passport data so you can sign up to our TrustedCyber program". Bitch please you have my payment information, use that??

if you consider bug-density (security defect density) per LoC, it is even more of a sh1t show: no restrictions apply for developers to push their buggy code, but the security team needs to somehow proof that they aren't the malicious party?

totally off - considering the right way to build defensive/offsec/malicious tooling with AI isn't by using frontier models ... but run a serious of agents on tightly scoped tasks. see https://securitycryptographywhatever.com/2026/03/25/ai-bug-f... and https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jag... - this shuts out the average joe who works in an org where cyber security maturity is poor. joe does not know about how to orchestrate a fleet of agents and give them muppet names. all he knows is that the good guys are losing the fight.

Hizonner 2 days ago

> the right way to build defensive/offsec/malicious tooling with AI isn't by using frontier models ... but run a serious of agents on tightly scoped tasks

The right way to do it is to run a series of agents... many of which are nonetheless built on frontier models (and nearly none of which are built on some local 27B Qwen variant...). One thing the latest models are good at is orchestrating other agents.

fatata123 3 days ago

They have no choice, enterprise customers won’t touch them unless they take a position like this. It’s a practical decision for them at the end of the the day.

saidnooneever 3 days ago

all their decisions are based on sales. like other corporations especially those going for IPO. thats absolutely true. Any messaging outbound will be for that purpose mostly from a business perspective, regardless of what opinions or ideals the involved persons hold personally. Its good to keep that in mind indeed when looking at these things. People arent evil, but business incentives can definitely paint such a picture or otherwise work out suboptimally in the eyes of outsiders not privy to internal business reasoning.

brookst 3 days ago

> all their decisions are based on sales

That’s the edgy cynical thing, and too reductive to be meaningful. For one thing, it assumes perfect knowledge of how a decision will impact sales, which I assure you is not remotely the case.

Agreed on incentives, but it’s not binary. I’ve been involved in plenty of decisions in multiple Fortune 500’s where the deciding factors were taste, wanting or not wanting to work with a particular partner, etc.

I guess I’m saying that seeing corporate behavior as perfectly informed, single-goal-optimized, and deterministic is way oversimplifying. Often, not always.

dontlikeyoueith 2 days ago

It's an optimal first order approximation.

Anything anyone with a capital-C in their job title says in public should be assumed to be marketing material.

saidnooneever 3 days ago

worked at fortune 500 companies and biggest cyber vendors too. Notnin sales or c/d level ofcourse.(engineer) I am a cynic yes but have also seen that its largely true in many cases where you'd hope ethics would win the argument (and does not).

still, you are right its cynical, the world is not black and white afterall :)

bluGill 3 days ago

I know that the enterprise I work for is getting really worried about security. I've been told to fix a lot of CVEs that previously we just ignored because realistically the attack isn't possible since the firewall doesn't allow the attack vector (if you already have root what does it matter if this exists)

Hizonner 3 days ago

Why would I, as an enterprise customer, care about what queries they answered for anybody else?

user43928 3 days ago

Mythos is supposedly good at security research.

Local Qwen 3.6 27B can hardly debug 5 lines of CSS or copy a short snippet from A to B without mangling it.

It's not like you can use the local model for security research or engineering biological weapons.

If you have $200k maybe you can get the hardware to run the larger open source models, but even they are behind latest proprietary models.

ecshafer 3 days ago

I asked local qwen 3.6 what language my project was written in. It was a Java project, and it came back with C#. So I guess its pretty close.

vlovich123 3 days ago

The guard rails aren’t about blocking professional malware authors. It’s about enabling a significantly larger population that isn’t as talented in acquiring those capabilities. Very different threat model and just because it’s not effective in one area doesn’t mean there isn’t value in making it more difficult for random Joe Schmoe in building an atomic bomb even if a kid before had done so successfully and turned his garage into a radiation danger site

varispeed 3 days ago

In other words security by obscurity.

vlovich123 3 days ago

Security by ineffective obscurity is worthless but it’s clearly a continuum and not a buzzword that wins the conversation.

For example, if I had a 128bit port number that I randomly rotated my service on, you’d be hard pressed to find my service unless I told you the port - obscurity still but clearly closer to a password. So ipv4 and 16 bit numbers are not because it’s a relatively small space vs the resources needed to map it out quickly (ie equivalent to a weak password and also not suitable for public facing services that need that connection). And obviously relying on this kind of stuff exclusively isn’t wise but it is valuable as an additional barrier an attacker has to overcome and raises the cost of the attack.

I’ll put the anarchist cookbook out there [1] as an example, a book even the original author changed his mind on. Without easy recipes, doing all the things in that book requires you to work to gain that knowledge and that process of working it shapes you into someone who understands and appreciates the consequences of that knowledge and that it’s wise to be careful who you share it with. As is there’s reasonable links between the book and all kinds of mass violence that was more easily perpetrated. Would those people still have been violent? Possibly? Would there have been as much damage? Possibly less.

[1] https://en.wikipedia.org/wiki/The_Anarchist_Cookbook

teravor 3 days ago

the way the fable guardrails (the ones that degrade it to opus) work seems to me to involve another model working over fable's tokens. i suppose its true that trying to get the model itself heavyhanded on refusals degrades it everywhere else too.

ryukoposting 2 days ago

I just assumed the guardrails were thinly-veiled product segmentation.