|
|
|
|
|
by orangebread
57 days ago
|
|
Wow. This benchmark definitely feels more accurate than the other rankings I've seen. My experience with gpt 5.4/5.5 is that they are technically flawless and if there are any technical issues that is because the input didn't provide enough clarity; that's not to say that it doesn't autonomously react to any issues during bug fixes or implementations, but it'll tend to nail its tasks without leaving behind gaps. Opus otoh is overrated in terms of its technical ability. It is certainly a better designer/developer for beautiful user experiences, but I'll always lean on gpt 5.5 to check its work. The biggest surprise in the benchmark is Xiao-Mi. I haven't tried it yet, but I will be after looking at this. Grats on your team for putting together something meaningful to make sense of the ongoing AI speedrun! Great work! |
|
Your comment makes it sound like they are miles apart, which the benchmark doesn't seem to support.
Edit: I looked at the data more and the two models are only basically equal when looking at the mean of all the tests. Gpt 5.5 significantly outperforms opus 4.7 in coding, while opus 4.7 significantly outperforms in "decision making." I'm not seeing details on what decision making explicitly means.