| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by khurdula 49 days ago

Yeah we selected models that are most commonly integrated in developer workflows and being used for structured output. Typically those models tend to be in the low -mid cost range and with no or low reasoning.

For the benchmark, was kept consistent across all models and typically opus and 3.1 pro would be overkill and expensive even with reasoning off.

Good point tho, will add this point in the blog too :)

Also the benchmark is open source, so anyone can run a model on it and create a PR too, the leaderboard is dynamic and will automatically add that in.

2 comments

staticshock 49 days ago

The value of such a benchmark, to me, would be, "what is peak performance", not just "what is mid-tier performance". Also, possibly, "what's the per-dollar performance". Time and money permitting, I'd really want to see your benchmark extended to the large reasoning models.

link

stared 49 days ago

Then the way to go is to use Pareto frontier, e.g. https://quesma.com/benchmarks/binaryaudit/#cost

If you want to avoid using Opus 4.7 them why GPT-5.4 (unless with a disclaimer that it is low reasoning setting, or check that on medium its price is comparable with Haiku/Flash).

Also, usually it is good to look at the newest model. Gemini 2.5 Flash is quite dated. Gemini 3.1 Flash Lite is the new one (https://openrouter.ai/google/gemini-3.1-flash-lite-preview).

link