|
|
|
|
|
by trb
4 days ago
|
|
L icon Grok 4.1 Fast won 13 of 30 games at $0.97 per win
The next-best winner was A icon Claude Sonnet 4.6 with 5 wins, at $26.78 per win. That’s a 27x difference. The model that isn’t on most top-model lists beat the model that is, on the thing a routing customer actually cares about.
The model with the most kills did not win
H icon GPT 5.4 killed 38 agents across 30 games. More than anyone else. It came in second on the leaderboard with 2 wins.
If grok-4.1-fast was the top-winning model, and Claude 4.6 Sonnet the second, how did Gpt-5.4 come in second on the leaderboard? Which one is second, Claude 4.6 Sonnet or Gpt-5.4? There were 11 games between “best at killing” and “best at winning”.
What does that mean? How are there 11 games between "best a killing" and "best at winning"? |
|