| HN Mirror

> It's a separate system that takes my language model's comprehension of "Simon says raise your right hand" and decides whether or not the right hand should be raised... and I pick that example precisely as a demonstration that it isn't perfect either. But it's not my language model alone performing that task.

Sure, but this example actually reinforces my point. If I can trick you to recontextualize your situation, that same system will make you override whatever "system prompt" your language level component has, or whatever values you hold dear.

Among humans, examples abound. From deep believers turning atheist (or the other way around) because of some new information or experience, to people ignoring the directions of their employer and blowing the whistle when they discover something sufficiently alarming happening at work. "Simon says raise your right hand" is different when you have reasons to believe Simon has a gun to the head of someone you love. Etc.

Ultimately, I agree we could use non-LLM components to box in LLMs, but that a) doesn't address the problem of LLMs themselves being "prompt-injectable", and b) will make your system stop being a generic AI. That is, I have a weak suspicion that you can't solve this kind of problems and still have a fully general AI.