Hacker News new | ask | show | jobs
by leetvibecoder 69 days ago
I see - so essentially „context rot“ eventually leads the LLM to „forget“ safety guardrails?
1 comments

To an extent, because, based on github notes again, it seems the 2nd part of this jailbreak is model being 'confused' over prompt, because the prompt is - apparently - sufficiently ambigous to make model 'forget' to 'evaluate' message for whether it should be rejected, and move onto 'execution' stage.

That's the ambiguity front-loading; and that is why I referred initially to the long context, because here it is almost the opposite; making context so small and unclear, that the model has a hard time parsing it properly.

edit: i did not test it, but i personally did run into 4o context issue, where model did something safety team would argue it should not

edit2: in current gpt model, i am currently testing something not relying on ambiguity, but on tension between some ideas. I didn't get to a jailbreak, but the small nudges suggest it could work.