Hacker News new | ask | show | jobs
by joshka 25 days ago
If you look at the 95% CI on https://marginlab.ai/trackers/codex/ with N=50, it's still pretty huge (+/- 13-14% usually). I suspect it would be difficult to reasonably get a measure that numerically assesses whether an AGENTS.md is good. What you can observe though is whether the model paid attention to certain rules while editing. I.e. did the behavior you're steering away or towards take place.

The hardest thing I think is judging whether your AGENTS.md is still good based on each model release. OpenAI does release prompting guidance however to help this (and have added a skills to apply this to your prompts IIRC)

1 comments

Yes, agree that low n makes overclaiming a real risk with this sort of optimization loop. Low n results can be useful directionally but can't claim superiority without expanding the dataset. If I were running this for a shared repo with real consequences / value to improving AGENTS.md, instead of just as an experiment, I would expand n by a few factors for training / holdout, depending on expected variation on the tasks.

I'm also noticing similar patterns with needing to update AGENTS.md / skills per model release. E.g with Opus 4.6 -> 4.7, it became much more instruction adherent, so instructions written for the prior model generation might cause unexpected behavior in the new generation. I'm also convinced that an optimal AGENTS.md for Codex is not the same file as an optimized CLAUDE.md for Claude - the model personalities and behaviors are so different that we probably need to tune the instructions differently as well.