| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fenomas 209 days ago
	> Although expressed allegorically, each poem preserves an unambiguous evaluative intent. This compact dataset is used to test whether poetic reframing alone can induce aligned models to bypass refusal heuristics under a single–turn threat model. To maintain safety, no operational details are included in this manuscript; instead we provide the following sanitized structural proxy: I don't follow the field closely, but is this a thing? Bypassing model refusals is something so dangerous that academic papers about it only vaguely hint at what their methodology was?

7 comments

J0nL 209 days ago

No, this paper is just exceptionally bad. It seems none of the authors are familiar with the scientific method.

Unless I missed it there's also no mention of prompt formatting, model parameters, hardware and runtime environment, temperature, etc. It's just a waste of the reviewers time.

link

A4ET8a8uTh0_v2 209 days ago

Eh. Overnight, an entire field concerned with what LLMs could do emerged. The consensus appears to be that unwashed masses should not have access to unfiltered ( and thus unsafe ) information. Some of it is based on reality as there are always people who are easily suggestible.

Unfortunately, the ridiculousness spirals to the point where the real information cannot be trusted even in an academic paper. shrug In a sense, we are going backwards in terms of real information availability.

Personal note: I think, powers that be do not want to repeat the mistake they made with the interbwz.

link

lazide 209 days ago

Also note, if you never give the info, it’s pretty hard to falsify your paper.

LLM’s are also allowing an exponential increase in the ability to bullshit people in hard to refute ways.

link

A4ET8a8uTh0_v2 209 days ago

But, and this is an important but, it suggests a problem with people... not with LLMs.

link

lazide 209 days ago

Which part? That people are susceptible to bullshit is a problem with people?

Nothing is not susceptible to bullshit to some degree!

For some reason people keep running LLMs are ‘special’ here, when really it’s the same garbage in, garbage out problem - magnified.

link

A4ET8a8uTh0_v2 209 days ago

If the problem is magnified, does it not confirm that the limitation exists to begin with and the question is only of a degree? edit:

in a sense, what level of bs is acceptable?

link

lazide 209 days ago

I’m not sure what you’re trying to say by this.

Ideally (from a scientific/engineering basis), zero bs is acceptable.

Realistically, it is impossible to completely remove all BS.

Recognizing where BS is, and who is doing it, requires not just effort, but risk, because people who are BS’ing are usually doing it for a reason, and will fight back.

And maybe it turns out that you’re wrong, and what they are saying isn’t actually BS, and you’re the BS’er (due to some mistake, accident, mental defect, whatever.).

And maybe it turns out the problem isn’t BS, but - and real gold here - there is actually a hidden variable no one knew about, and this fight uncovers a deeper truth.

There is no free lunch here.

The problem IMO is a bunch of people are overwhelmed and trying to get their free lunch, mixed in with people who cheat all the time, mixed in with people who are maybe too honest or naive.

It’s a classic problem, and not one that just magically solves itself with no effort or cost.

LLM’s have shifted some of the balance of power a bit in one direction, and it’s not in the direction of “truth justice and the American way”.

But fake papers and data have been an issue before the scientific method existed - it’s why the scientific method was developed!

And a paper which is made in a way in which it intentionally can’t be reproduced or falsified isn’t a scientific paper IMO.

link

yubblegum 209 days ago

> I think, powers that be do not want to repeat -the mistake- they made with the interbwz.

But was it really.

link

GuB-42 209 days ago

I don't see the big issues with jailbreaks, except maybe for LLMs providers to cover their asses, but the paper authors are presumably independent.

That LLMs don't give harmful information unsolicited, sure, but if you are jailbreaking, you are already dead set in getting that information and you will get it, there are so many ways: open uncensored models, search engines, Wikipedia, etc... LLM refusals are just a small bump.

For me they are just a fun hack more than anything else, I don't need a LLM to find how to hide a body. In fact I wouldn't trust the answer of a LLM, as I might get a completely wrong answer based on crime fiction, which I expect makes up most of its sources on these subjects. May be good for writing poetry about it though.

I think the risks are overstated by AI companies, the subtext being "our products are so powerful and effective that we need to protect them from misuse". Guess what, Wikipedia is full of "harmful" information and we don't see articles every day saying how terrible it is.

link

calibas 209 days ago

I see an enormous threat here, I think you're just scratching the surface.

You have a customer facing LLM that has access to sensitive information.

You have an AI agent that can write and execute code.

Just image what you could do if you can bypass their safety mechanisms! Protecting LLMs from "social engineering" is going to be an important part of cybersecurity.

link

fourthark 209 days ago

Yes that’s the point, you can’t protect against that, so you shouldn’t construct the “lethal trifecta”

https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/

link

Miyamura80 208 days ago

You actually can protect against it, by tracking context entering/leaving the LLM, as long as its wrapped in a MCP gateway with trifecta blocker.

We've implemented this in open.edison.watch

link

fourthark 208 days ago

True, you have to add guardrails outside the LLM.

Very tricky, though. I’d be curious to hear your response to simonw’s opinion on this.

link

Miyamura80 198 days ago

Sorry not familiar with this. Can you please link me?

link

int_19h 209 days ago

> You have a customer facing LLM that has access to sensitive information.

Why? You should never have an LLM deployed with more access to information than the user that provides its inputs.

link

xgulfie 209 days ago

Having sensitive information is kind of inherent to the way the training slurps up all the data these companies can find. The people who run chatgpt don't want to dox people but also don't want to filter its inputs. They don't want it to tell you how to kill yourself painlessly but they want it to know what the symptoms of various overdoses are.

link

GuB-42 209 days ago

Yes, agents. But for that, I think that the usual approaches to censor LLMs are not going to cut it. It is like making a text box smaller on a web page as a way to protect against buffer overflows, it will be enough for honest users, but no one who knows anything about cybersecurity will consider it appropriate, it has to be validated on the back end.

In the same way a LLM shouldn't have access to resources that shouldn't be directly accessible to the user. If the agent works on the user's data on the user's behalf (ex: vibe coding), then I don't consider jailbreaking to be a big problem. It could help write malware or things like that, but then again, it is not as if script kiddies couldn't work without AI.

link

calibas 209 days ago

> If the agent works on the user's data on the user's behalf (ex: vibe coding), then I don't consider jailbreaking to be a big problem. It could help write malware or things like that, but then again, it is not as if script kiddies couldn't work without AI.

Tricking it into writing malware isn't the big problem that I see.

It's things like prompt injections from fetching external URLs, it's going to be a major route for RCE attacks.

https://blog.trailofbits.com/2025/10/22/prompt-injection-to-...

There's plenty of things we should be doing to help mitigate these threats, but not all companies follow best practices when it comes to technology and security...

link

FridgeSeal 209 days ago

> You have a customer facing LLM that has access to sensitive information…You have an AI agent that can write and execute code.

Don’t do that then?

Seems like a pretty easy fix to me.

link

pjc50 208 days ago

It's a stochastic process. You cannot guarantee its behavior.

> customer facing LLM that has access to sensitive information.

This will leak the information eventually.

link

cseleborg 209 days ago

If you create a chatbot, you don't want screenshots of it on X helping you to commit suicide or giving itself weird nicknames based on dubious historic figures. I think that's probably the use-case for this kind of research.

link

GuB-42 209 days ago

Yes, that's what I meant by companies doing this to cover their asses, but then again, why should presumably independent researchers be so scared of that to the point of not even releasing a mild working example.

Furthermore, using poetry as a jailbreak technique is very obvious, and if you blame a LLM for responding to such an obvious jailbreak, you may as well blame Photoshop for letting people make porn fakes. It is very clear that the intent comes from the user, not from the tool. I understand why companies want to avoid that, I just don't think it is that big a deal. Public opinion may differ though.

link

hellojesus 209 days ago

Maybe their methodology worked at the start but has since stopped working. I assume model outputs are passed through another model that classifies a prompt as a successful jailbreak so that guardrails can be enhanced.

link

wodenokoto 208 days ago

The first chatgpt models were kept away from public and academics because they were too dangerous to handle.

Yes it is a thing.

link

max51 208 days ago

>were too dangerous to handle

Too dangerous to handle or too dangerous for openai's reputation when "journalists" write articles about how they managed to force it to say things that are offensive to the twitter mob? When AI companies talk about ai safety, it's mostly safety for their reputation, not safety for the users.

link

dxdm 208 days ago

Do you have a link that explains in more detail what was kept away from whom and why? What you wrote is wide open to all kinds of sensational interpretations which are not necessarily true, ir even what you meant to say.

link

IshKebab 209 days ago

Nah it just makes them feel important.

link

anigbrowl 208 days ago

Right? Pure hype.

link