| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by skysniper 74 days ago

I ran 300+ benchmarks across 15 models in OpenClaw and published two separate leaderboards: performance and cost-effectiveness.

The two boards look nothing alike. Top 3 performance: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Top 3 cost-effectiveness: StepFun 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7.

The most dramatic split: Claude Opus 4.6 is #1 on performance but #14 on cost-effectiveness. StepFun 3.5 Flash is #1 cost-effectiveness, #5 performance.

Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance.

Rankings use relative ordering only (not raw scores) fed into a grouped Plackett-Luce model with bootstrap CIs. Same principle as Chatbot Arena — absolute scores are noisy, but "A beat B" is reliable. Full methodology: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn

I built this as part of OpenClaw Arena — submit any task, pick 2-5 models, a judge agent evaluates in a fresh VM. Public benchmarks are free.

4 comments

vessenes 74 days ago

Cheapest just isn't a very useful metric. Can I suggest a Pareto-curve type representation? Cost / request vs ELO would be useful and you have all the data.

skysniper 74 days ago

TBH that was my initial thought too, but I found some problem using this approach:

Essentially I'm using the relative rank in each battle to fit a latent strength for each model, and then use a nonlinear function to map the latent strength to Elo just for human readability. The map function is actually arbitrary as long as it's a monotonically increasing function so it preserves the rank. The only reliable result (that is invariant to the choice of the function) is the relative rank of models.

That being said, if I use score/cost as metrics, the rank completely depends on the function I choose, like I can choose a more super-linear function to make high performance model rank higher in score/cost board, or use a more sub-linear function to make low performance model rank higher.

That's why I eventually tried another (the current) approach: let judge give relative rank of models just by looking at cost-effectiveness (consider both performance and cost), and compute the cost-effectiveness leaderboard directly, so the score mapping function does not affect the leaderboard at all.

refulgentis 74 days ago

Please don’t use AI to write comments, it cuts against HN guidelines.

skysniper 74 days ago

sorry didn't know that. Here is my hand writing tldr:

gemini is very unreliable at using skills, often just read skills and decide to do nothing.

stepfun leads cost-effectiveness leaderboard.

ranking really depends on tasks, better try your own task.

refulgentis 74 days ago

It’s too late once it’s happened. I was curious, then when I saw the site looked vibecoded and you’re commenting with AI, I decided to stop trying to reason through the discrepancies between what was claimed and what’s on the site (ex. 300 battles vs. only a handful in site data).

rat9988 74 days ago

Too late for what? For you? maybe. There are many others that are okay with it and it doesn't disminish the quality of the work. Props to the author.

refulgentis 74 days ago

> Too late for what? For you? maybe.

Maybe? :)

> There are many others that are okay with it

Correct.

> and it doesn't disminish the quality of the work.

It does affect incoming people hearing about the work.

I applaud your instinct to defend someone who put in effort. It's one of the most important things we can do.

Another important thing we can do for them is be honest about our own reactions. It's not sunshine and rainbows on its face, but, it is generous. Mostly because A) it takes time B) other people might see red and harangue you for it.

skysniper 74 days ago

all 300+ battle data are available at https://app.uniclaw.ai/arena/battles, every single battle is shown with raw conversional history, produced files, judge's verdict and final scores

refulgentis 74 days ago

Thanks! Is the judge an LLM? There's lot of references to "just like LMArena", but LMArena is human evaluated?

skysniper 74 days ago

> Is the judge an LLM?

Yes, judge is one of opus 4.6, gpt 5.4, gemini 3.1 pro (submitter can choose). Self judge (judge model is also one of the participants) is excluded when computing ranking.

> There's lot of references to "just like LMArena", but LMArena is human evaluated?

Yeah LMArena is human evaluated, but here i found it not practical to gather enough human evaluation data because the effort it take to compare the result is much higher:

- for code, judge needs to read through it to check code quality, and actually run it to see the output

- when producing a webpage or a document, judge needs to check the content and layout visually

- when anything goes wrong, judge needs to read the execution log to see whether partial credit shall be granted

if you look at the cost details of each battle (available at the bottom of battle detail page), judge typically cost more than any participant model.

if we evaluate with human, i would say each evaluation can easily take ~5-10 min

citizenpaul 74 days ago

>Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance

This has also been my subjective experience But has also been objective in terms of cost.

johndough 74 days ago

Could you add a column for time or number of tokens? Some models take forever because of their excessive reasoning chains.

skysniper 74 days ago

both are shown in battle detail page already. Time is shown in Scores table. Number of tokens are shown in Cost details at the bottom of the Scores. (I thought most people just want to see cost in USD so I put token details at the bottom)

johndough 74 days ago

I would have liked aggregated results instead. Expanding 300 tables is a bit tiresome. But I guess that is easy with AI now. Here is a scatter plot of quality vs duration

https://i.imgur.com/wFVSpS5.png

and quality vs cost

https://i.imgur.com/fqM4edw.png

But I just noticed that my plot is meaningless because it conflates model quality with provider uptime.

Claude Haiku has a higher average quality than Claude Opus, which does not make sense. The explanation is that network errors were credited with a quality score of 0, and there were _a lot_ of network errors.

skysniper 74 days ago

> The explanation is that network errors were credited with a quality score of 0, and there were _a lot_ of network errors.

all network error, provider error, openclaw error are excluded from ranking calculation actually, so that is not the reason.

Real reason:

The absolute score is not consistent across tasks and cannot be directly added/averaged, for both human and LLM. But the relative rank is stable (model A is better than B). That is exactly why Chatbot Arena only uses the relative rank of models in each battle in the first place, and why we follow that approach.

a concrete example of why score across tasks cannot be added/averaged directly: people tend to try haiku with easier task and compare with T2 models, and try opus with harder task and compare with better models.

another example: judge (human or llm) tend to change score based on opponents, like Sonnet might get 10/10 if all other opponents are Haiku level, but might get 8/10 if opponent has Opus/gpt-5.4.

So if you want to make the plot, you should plot the elo score (in leaderboard) vs average cost per task. But note: the average cost has similar issue, people use smaller model to run simpler task naturally, so smaller model's lower cost comes from two factor: lower unit cost, and simpler task.

methodology page contains more details if you are interested.

johndough 74 days ago

I agree. If humans are allowed to pick the models, there will be an inherent bias. This would be much easier if the models were randomized.

esafak 74 days ago

The second chart depicts StepFun > Sonnet > Opus in quality?

skysniper 74 days ago

check out my reply, his chart is plotting the wrong metric (average quality score)

skysniper 74 days ago

i added native plot and stats for aggregated results, on arena page. please check it out!

johndough 73 days ago

Nice! It would be even better if the model name was shown by default instead of having to hover, but I got the information that I wanted. In case you should be concerned about the aesthetics with too many model names, I can recommend the adjustText library in Python, which makes it so that labels do not overlap. Something similar probably exists in JS (or an LLM can just translate the relevant bits).

hadlock 74 days ago

some kind of top-level metric like avg tokens/task would be useful. e.g. yes stepfun is 5% the price of sonnet, but does it use 1x, 10x or 1000x more tokens to accomplish similar tasks/median per task. for example I am willing to eat a 20% quality dive from sonnet if the token use is < 10% more than sonnet. if token use is 1000x then that's something I want to know.

skysniper 74 days ago

added https://app.uniclaw.ai/arena/model-stats

also added per battle stats in battle detail page