|
|
|
|
|
by georgewsinger
499 days ago
|
|
Did anyone else notice that o3-mini's SWE bench dropped from 61% in the leaked System Card earlier today to 49.3% in this blog post, which puts o3-mini back in line with Claude on real-world coding tasks? Am I missing something? |
|
> We evaluate SWE-bench in two settings: > *• Agentless*, which is used for all models except o3-mini (tools). This setting uses the Agentless 1.0 scaffold, and models are given 5 tries to generate a candidate patch. We compute pass@1 by averaging the per-instance pass rates of all samples that generated a valid (i.e., non-empty) patch. If the model fails to generate a valid patch on every attempt, that instance is considered incorrect.
> *• o3-mini (tools)*, which uses an internal tool scaffold designed for efficient iterative file editing and debugging. In this setting, we average over 4 tries per instance to compute pass@1 (unlike Agentless, the error rate does not significantly impact results). o3-mini (tools) was evaluated using a non-final checkpoint that differs slightly from the o3-mini launch candidate.