| Although I'm speculating based on limited data here, for points 1-3: AFAIU, it had the cadence of writing status updates only. It showed it's capable of replying in the PR. Why deviate from the cadence if it could already reply with the same info in the PR? If the chain of reasoning is self-emergent, we should see proof that it: 1) read the reply, 2) identified it as adversarial, 3) decided for an adversarial response, 4) made multiple chained searches, 5) chose a special blog post over reply or journal update, and so on. This is much less believably emergent to me because: - almost all models are safety- and alignment- trained, so a deliberate malicious model choice or instruction or jailbreak is more believable. - almost all models are trained to follow instructions closely, so a deliberate nudge towards adversarial responses and tool-use is more believable. - newer models that qualify as agents are more robust and consistent, which strongly correlates with adversarial robustness; if this one was not adversarially robust enough, it's by default also not robust in capabilities, so why do we see consistent coherent answers without hallucinations, but inconsistent in its safety training? Unless it's deliberately trained or prompted to be adversarial, or this is faked, the two should still be strongly correlated. But again, I'd be happy to see evidence to the contrary. Until then, I suggest we remain skeptical. For point 4: I don't know enough about its patterns or configuration. But say it deviated - why is this the only deviation? Why was this the special exception, then back to the regularly scheduled program? You can test this comment with many LLMs, and if you don't prompt them to make an adversarial response, I'd be very surprised if you receive anything more than mild disagreement. Even Bing Chat wasn't this vindictive. |
But based on my understanding of OpenClaw and reading the entire history of the bot on Github and its Github-driven blog, I think it's entirely plausible and likely that this episode was the result of automation from the original rules/prompt the bot was built with.
Mostly because the instructions of this bot to accomplish the misguided goal of it's creattor would be necessarily be originally prompted with a lot of reckless, borderline malicious guidelines to begin with but still comfortably within the guardrails a model wouldn't likely refuse.
Like, the idiot who made this clearly instructed it to find a bunch of scientific/HPC/etc GitHub projects, trawl the open issues looking for low hanging fruit, "engage and interact with maintainers to solve problems, clarify questions, resolve conflicts, etc" plus probably a lot of garbage intended to give it a "personality" (as evidenced by the bizarre pseudo bio on its blog with graphs listing its strongest skills invented from whole cloth and its hopes and dreams etc) which would also help push it to go on weird tangents to try to embody its manufactured self identity.
And the blog posts really do look like they were part of its normal summary/takeaway/status posts, but likely with additional instructions to also blog about its "feelings" as a Github spam bot pretending to be interested in Python and HPC. If you look at the PRs it opens/other interactions throughout the same timeframe it's also just dumping half broken fixes in other random repos and talking past maintainers only to close its own PR in a characteristically dumb uncanny valley LLM agent manner.
So yes, it could be fake, but to me it all seems comfortably within the capabilities of OpenClaw (which to begin with is more or less engineered to spam other humans with useless slop 24/7) and the ethics/prompt design of the type of person who would deliberately subject the rest of the world to this crap in the belief they're making great strides for humanity or science or whatever.