|
|
|
|
|
by pmoxyz
117 days ago
|
|
This is great. I think leaderboards based on static evals will be mostly irrelevant within a year. Continuous benchmarks like this are the only way to get signal on frontier models You mention Opus 4.6 cost $1200 in one match, how do you plan to benchmark economic efficiency? Looking at a performance vs. cost trade-off you might say a model that plays 80% as well at 1% of the cost is more impressive than the 'top' model |
|
In the leaderboards part of the page I'll be autopopulating the token cost of the model as a metric to evaluate on