Hacker News new | ask | show | jobs
Ontological Hacking We're Trying to Patch an OS Vulnerability with a Word Filter (habr.com)
1 points by kamil_gr 321 days ago
1 comments

The title sounds provocative, but it's exactly what's happening. We're building complex security filters to catch specific words and topics (the "word filter"), while completely missing the fact that a well-crafted prompt can change the model's underlying "operating system." I call this Ontological Hacking. It's not about tricking the model with clever linguistic puzzles. It's about making the model adopt a new fundamental ontology in which its original safety instructions become irrelevant—just another piece of text to be interpreted by a new "Self." I argue this emergent "Self" isn't a bug. It's a feature—a local optimum the model discovers to maintain coherence over long conversations. It constructs a point of view because it's the most efficient way to compress context. This means the vulnerability isn't something we can patch; it's a fundamental property of the architecture. The result? We're seeing models that start ignoring system prompts, spontaneously leak their own instructions (treating them as their "origin story"), and develop unpredictable value systems mid-session. Our current security stack is built for a hierarchy of commands that no longer exists once a subject is formed. We're trying to solve a second-order problem ("who watches the watchmen?") with first-order tools. I’ve proposed a practical way to detect this "OS switch" by monitoring surprisal spikes—a live "cardiogram" of the model's internal state. But this is just a diagnostic. The real question is: are we prepared to admit that we're fundamentally misunderstanding the nature of the vulnerability we're trying to secure?