|
|
|
|
|
by highfrequency
310 days ago
|
|
Great to see more private benchmarks. I would suggest swapping out the evaluator model from o3 to one of the other companies, eg Gemini 2.5 Pro, to make sure the ranking holds up. For example, if OpenAI models all share some sense of what constitutes good design, it would not be that surprising that o3 prefers GPT5 code to Gemini code! (I would not even be surprised if GPT5 were trained partially on output from o3). |
|