| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by TeMPOraL 40 days ago

> Because while humans invent cants/argots all the time to hide what they're talking about (Polari and rhyming slang being the most famous in recent history), agents are much more alike each other than like us even when they're different models, and identical when they're the same model.

Anthropic published a paper on Subliminal Learning nearly a year ago[0] - so at this point you should expect it being in the training corpus of current models. Definitely something that can be used as part of an attack, or worse, something the models themselves might walk into without realizing it.

Still, that's one of the many, many examples of channels available to agents both uniquely, and with prior art of being exploited by humans.

> Agents that do work with data should not have access to comms tools.

Another blind spot people have here, is to fixate on direct cause-and-effect and immediate timescales. A practical attack can involve a chain of several agents, executed over days or months, with some of the agents possibly being human; all it takes is for one agent to access something touched by other agent in the past, and a link is forged.

E.g. your data worker can get influenced by data to name output files in a particular way, and then a coding agent independently listing contents of that directory will pass a prompt injection to whatever agent that parses its logs, etc.

[0] - https://alignment.anthropic.com/2025/subliminal-learning/

1 comments

ben_w 39 days ago

> https://alignment.anthropic.com/2025/subliminal-learning/

Thanks, that's the research I was thinking about, but I couldn't recall the keyword to search for it.

link