|
|
|
|
|
by gertlabs
60 days ago
|
|
A better benchmark needs to be objectively scored, have multi-disciplinary, breadth, and be scalable (no single correct answer). That's what we designed at https://gertlabs.com. We put a lot of thought into it, and kept it mostly (not fully) related to problem solving through coding. |
|
Opus otoh is overrated in terms of its technical ability. It is certainly a better designer/developer for beautiful user experiences, but I'll always lean on gpt 5.5 to check its work.
The biggest surprise in the benchmark is Xiao-Mi. I haven't tried it yet, but I will be after looking at this.
Grats on your team for putting together something meaningful to make sense of the ongoing AI speedrun! Great work!