| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by johndough 74 days ago

I would have liked aggregated results instead. Expanding 300 tables is a bit tiresome. But I guess that is easy with AI now. Here is a scatter plot of quality vs duration

https://i.imgur.com/wFVSpS5.png

and quality vs cost

https://i.imgur.com/fqM4edw.png

But I just noticed that my plot is meaningless because it conflates model quality with provider uptime.

Claude Haiku has a higher average quality than Claude Opus, which does not make sense. The explanation is that network errors were credited with a quality score of 0, and there were _a lot_ of network errors.

3 comments

skysniper 74 days ago

> The explanation is that network errors were credited with a quality score of 0, and there were _a lot_ of network errors.

all network error, provider error, openclaw error are excluded from ranking calculation actually, so that is not the reason.

Real reason:

The absolute score is not consistent across tasks and cannot be directly added/averaged, for both human and LLM. But the relative rank is stable (model A is better than B). That is exactly why Chatbot Arena only uses the relative rank of models in each battle in the first place, and why we follow that approach.

a concrete example of why score across tasks cannot be added/averaged directly: people tend to try haiku with easier task and compare with T2 models, and try opus with harder task and compare with better models.

another example: judge (human or llm) tend to change score based on opponents, like Sonnet might get 10/10 if all other opponents are Haiku level, but might get 8/10 if opponent has Opus/gpt-5.4.

So if you want to make the plot, you should plot the elo score (in leaderboard) vs average cost per task. But note: the average cost has similar issue, people use smaller model to run simpler task naturally, so smaller model's lower cost comes from two factor: lower unit cost, and simpler task.

methodology page contains more details if you are interested.

link

johndough 74 days ago

I agree. If humans are allowed to pick the models, there will be an inherent bias. This would be much easier if the models were randomized.

link

esafak 73 days ago

The second chart depicts StepFun > Sonnet > Opus in quality?

link

skysniper 73 days ago

check out my reply, his chart is plotting the wrong metric (average quality score)

link

skysniper 73 days ago

i added native plot and stats for aggregated results, on arena page. please check it out!

link

johndough 73 days ago

Nice! It would be even better if the model name was shown by default instead of having to hover, but I got the information that I wanted. In case you should be concerned about the aesthetics with too many model names, I can recommend the adjustText library in Python, which makes it so that labels do not overlap. Something similar probably exists in JS (or an LLM can just translate the relevant bits).

link