| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pmoxyz 117 days ago
	This is great. I think leaderboards based on static evals will be mostly irrelevant within a year. Continuous benchmarks like this are the only way to get signal on frontier models You mention Opus 4.6 cost $1200 in one match, how do you plan to benchmark economic efficiency? Looking at a performance vs. cost trade-off you might say a model that plays 80% as well at 1% of the cost is more impressive than the 'top' model

1 comments

mbh159 117 days ago

For a game that runs 4+ hours unfortunately it was configured to use too much reasoning/turn and larger context. Reducing the size helped lower the cost (still expensive).

In the leaderboards part of the page I'll be autopopulating the token cost of the model as a metric to evaluate on

link