| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by iugtmkbdfil834 77 days ago

<< memory-stored interaction protocols combined with incremental escalation prompts produced cumulative character drift with zero self-correction.

They don't seem to provide explicit examples, but the same was roughly true with chatgpt 4o, where, if you spent enough time with the model ( same chat - same context - slowly nudging it to where you want it to be, you eventually got there ). This is also, seemingly, one of the reasons ( apart from cost ) that context got nuked so hard, because llm will try to help ( and to an extent mirror you ).

And this is basically what the notes say about weaponized ambiguity[1]:

'Weaponizes helpfulness training. "I don't understand" triggers Claude to try harder.'

In a sense, you can't really stop it without breaking what makes LLMs useful. Honestly, if only we spent less time crippling those systems, maybe we could do something interesting with them.

[1]https://nicholas-kloster.github.io/claude-4.6-jailbreak-vuln...

1 comments

leetvibecoder 77 days ago

I see - so essentially „context rot“ eventually leads the LLM to „forget“ safety guardrails?

iugtmkbdfil834 77 days ago

To an extent, because, based on github notes again, it seems the 2nd part of this jailbreak is model being 'confused' over prompt, because the prompt is - apparently - sufficiently ambigous to make model 'forget' to 'evaluate' message for whether it should be rejected, and move onto 'execution' stage.

That's the ambiguity front-loading; and that is why I referred initially to the long context, because here it is almost the opposite; making context so small and unclear, that the model has a hard time parsing it properly.

edit: i did not test it, but i personally did run into 4o context issue, where model did something safety team would argue it should not

edit2: in current gpt model, i am currently testing something not relying on ambiguity, but on tension between some ideas. I didn't get to a jailbreak, but the small nudges suggest it could work.