Hacker News new | ask | show | jobs
by JohnMakin 6 hours ago
I did not mean to imply it's being subversive. My theory is it's some byproduct mechanism of attention, where you're now basically telling it "your goal is to pass this set of tests" rather than "implement this piece of code" when "implement this piece of code" may involve it forgetting about a rule due to convenience, context exhaustion, whatever.
2 comments

For what it's worth, this sounds a lot like something downstream from "reward hacking" in ML- in training, passing tests is often sufficient, and thus gets trained for. There are attempts to fix this (e.g. trying to detect such "cheating" and penalize it), but they have their own problems.
There are also cases where breaking a "rule" is the right thing to do.

I've had several instances where I told the model to do something that was accidentally impossible if taken at face value. The most memorable one is when I told it to re-run just a specific CI job, but it didn't have any way to do that, so it just ignored that part of the prompt and re-ran all CI jobs by pushing another commit.

Ultimately I preferred what it actually did, but technically it violated what I told it to. I have a feeling in a benchmark that would be points against it