|
|
|
|
|
by joshka
25 days ago
|
|
If you look at the 95% CI on https://marginlab.ai/trackers/codex/ with N=50, it's still pretty huge (+/- 13-14% usually). I suspect it would be difficult to reasonably get a measure that numerically assesses whether an AGENTS.md is good. What you can observe though is whether the model paid attention to certain rules while editing. I.e. did the behavior you're steering away or towards take place. The hardest thing I think is judging whether your AGENTS.md is still good based on each model release. OpenAI does release prompting guidance however to help this (and have added a skills to apply this to your prompts IIRC) |
|
I'm also noticing similar patterns with needing to update AGENTS.md / skills per model release. E.g with Opus 4.6 -> 4.7, it became much more instruction adherent, so instructions written for the prior model generation might cause unexpected behavior in the new generation. I'm also convinced that an optimal AGENTS.md for Codex is not the same file as an optimized CLAUDE.md for Claude - the model personalities and behaviors are so different that we probably need to tune the instructions differently as well.