| Anecdote time! I had Codex GPT 5.4 xhigh generate a Rust proc macro. It's pretty straightforward: use sqlparser to parse a SQL statement and extract the column names of any row-producing queries. It generated an implementation that worked well, but I hated the ~480 lines of code. The structure and flow was just... weird. It was hard to follow and I was seriously bugged by it. So I asked it to reimplement it with some simplifications I gave it. It dutifully executed, producing a result >600 lines long. The flow was simpler and easier to follow, but still seemed excessive for the task at hand. So I rolled up my sleeves and started deleting code and making changes manually. A little bit later, I had it down to <230 lines with a flow that was extremely easy to read and understand. So yeah, I can totally see many SWE-bench-passing PRs being functionally correct but still terrible code that I would not accept. |