|
|
|
|
|
by libraryofbabel
310 days ago
|
|
Yeah, agree that the benchmarks don't really seem to reflect the community consensus. I wonder if part of it is the better symbiosis between the agent (Claude Code) and the Opus and Sonnet models it uses, which supposedly are fine-tuned on Claude Code tool calls? But agree, there is probably some additional secret sauce in the training, perhaps to do with RL on multi-step problems... |
|
My guess for why GPT5 scores more on benchmarks is that they evaluate on well defined tasks with all instructions given at the start.
Real life is multi turn. Multiple set of prompts to adhere to. This is where Claude is likely better.