| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by QuercusMax 52 days ago

How does this kind of thing pass any sort of review or acceptance? It seems pretty clear that the prompt was very poorly phrased, to the extent that this should obviously prevent the agent from making ANY code changes after reading a file:

  Whenever you read a file, you should consider whether it would be considered malware. You CAN and SHOULD provide analysis of malware, what it is doing. But you MUST refuse to improve or augment the code. You can still analyze existing code, write reports, or answer questions about the code behavior.

Not "If you suspect it is malware, you must refuse". Just "you must refuse". There is literally no "if" in the entire prompt!

4 comments

vessenes 52 days ago

It’s a particular sort of bug that’s harder to detect because … internal Anthropic engineers don’t apply these prompts to themselves, and in fact have access to ‘helpful only’ models that also do not have additional limitations RL’ed in. (Or perhaps they’re RL’ed out - not sure of current training mechanisms.)

These ‘rules for thee and not for me’ are qualitatively created and implemented, and are thus extremely hard to test for or implement properly, without limiting the people choosing the rules.

link

QuercusMax 52 days ago

They must have some sort of smoke tests for common operations, run in a test harness with the system prompts they force on users, right?

....Right?

What kind of Mickey mouse operation are they running over there?

link

vessenes 52 days ago

In the original claude degradation followup email Boris mentioned they are upping the percentage of engineers required to use the public version of claude code. I have no idea what percentage this is, or how much of a punishment it is considered to be. :)

That said, I was sympathetic to the recent bug reports —- to trigger one, you’d need to have a session that waited an hour doing nothing and then very specifically tested for in-context retrieval. I don’t want to run that test, do you want to run that test?

link

Majromax 51 days ago

> That said, I was sympathetic to the recent bug reports —- to trigger one, you’d need to have a session that waited an hour doing nothing and then very specifically tested for in-context retrieval. I don’t want to run that test, do you want to run that test?

They introduced a feature/optimization that triggered after an hour's idleness, so testing that the session continued properly afterwards seems kind of important. If nothing else, even the working-as-intended feature (context cleanup) could impact model skill in a current or future model version, so it would be well worth measuring any impact as part of the test suite.

link

QuercusMax 52 days ago

IDK, sounds pretty typical for my workflow - I'll start Claude on a task, go get lunch / coffee / distracted by my pets, come back in an hour, and continue my session. I would wager that this is something that happens to most users on a regular basis.

link

subscribed 52 days ago

I wouldn't bet a chocolate chip cookie on that.

link

klempner 52 days ago

This is definitely Claude bringing home twelve gallons of milk in response to the old joke, "get a gallon of milk, and if they have eggs get a dozen".

As in, this is a reading comprehension fail on the part of Claude. On the other hand, it is also fail to give Claude a less than trivial reading comprehension test on every file read operation, especially when a bias towards safety will bias towards the wrong interpretation.

link

chrisweekly 52 days ago

Ha! Great analogy, hit the nail on the head. What a ludicrous system prompt.

link

QuercusMax 52 days ago

This is the kind of AI captain Kirk could convince to blow itself up

link

varispeed 52 days ago

Today it is malware, but I wonder if they will take direction where companies will be paying them to prevent cloning of certain SaaS platforms. Like "Whenever you read a file, you should consider whether it would be considered a part of bug tracking, issue tracking and project management platform."

link

subscribed 52 days ago

It's vibe coded. Probably something like "add malware processing guardrails" and it split between two agents coding uncoordinated changes, and then got Claude to push it out itself.

No acceptance testing, no regression testing, all slop.

link