Hacker News new | ask | show | jobs
by kristopolous 4 hours ago
I have a script that ranks these based on codingindex from Artificial Analysis.

All it does is pull a json from their main table page and parses it with the fields I care about (coding).

There used to be a mailing list associated with it but eh ... there wasn't much interest. I use the script every day though.

Current partial output

  score  age  size name
  47.1   58  large Kimi K2.6
  47.5   54  large DeepSeek V4 Pro (Reasoning, Max Effort)
  47.5   70    -   Muse Spark
  47.6   132   -   Claude Opus 4.6 (Non-reasoning, High Effort)
  47.8   205   -   Claude Opus 4.5 (Reasoning)
  48.1   132   -   Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  48.6   55    -   GPT-5.5 (Non-reasoning)
  48.7   188   -   GPT-5.2 (xhigh)
  50.1   29    -   Qwen3.7 Max
  50.7   1   large GLM-5.2 (max)
  50.9   120   -   Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  51.5   92    -   GPT-5.4 mini (xhigh)
  52.1   55    -   GPT-5.5 (low)
  52.5   62    -   Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  53.1   132   -   GPT-5.3 Codex (xhigh)
  53.1   62    -   Claude Opus 4.7 (Non-reasoning, High Effort)
  55.5   118   -   Gemini 3.1 Pro Preview
  56.2   55    -   GPT-5.5 (medium)
  56.7   20    -   Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  57.2   104   -   GPT-5.4 (xhigh)
  58.5   55    -   GPT-5.5 (high)
  59.1   55    -   GPT-5.5 (xhigh)
  62     8     -   Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
To see everything, run it like so

  $ curl day50.dev/art-analysis.sh | bash
The repo: https://github.com/day50-dev/aa-eval-email

some key takeaways:

* open models are on about a 4-7 month lag right now depending on how you want to measure it

* if this keeps up, you might see an open-weights model doing claude fable 5 level work before the new year.

if people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.

6 comments

  score  age  size   name
  62.0   8    -      Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
  59.1   55   -      GPT-5.5 (xhigh)
  58.5   55   -      GPT-5.5 (high)
  57.2   104  -      GPT-5.4 (xhigh)
  56.7   20   -      Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  56.2   55   -      GPT-5.5 (medium)
  55.5   118  -      Gemini 3.1 Pro Preview
  53.1   132  -      GPT-5.3 Codex (xhigh)
  53.1   62   -      Claude Opus 4.7 (Non-reasoning, High Effort)
  52.5   62   -      Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  52.1   55   -      GPT-5.5 (low)
  51.5   92   -      GPT-5.4 mini (xhigh)
  50.9   120  -      Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  50.7   1    large  GLM-5.2 (max)
  50.1   29   -      Qwen3.7 Max
  48.7   188  -      GPT-5.2 (xhigh)
  48.6   55   -      GPT-5.5 (Non-reasoning)
  48.1   132  -      Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  47.8   205  -      Claude Opus 4.5 (Reasoning)
Lol thank you for sorting.

Are the scores here normalized such that each point difference is equidistant?

  rank  score  age  size   name
  1     62.0   8    -      Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
  2     59.1   55   -      GPT-5.5 (xhigh)
  3     58.5   55   -      GPT-5.5 (high)
  4     57.2   104  -      GPT-5.4 (xhigh)
  5     56.7   20   -      Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  6     55.5   118  -      Gemini 3.1 Pro Preview
  7     53.1   62   -      Claude Opus 4.7 (Non-reasoning, High Effort)
  8     53.1   132  -      GPT-5.3 Codex (xhigh)
  9     52.5   62   -      Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  10    51.5   92   -      GPT-5.4 mini (xhigh)
  11    50.9   120  -      Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  12    50.7   1    large  GLM-5.2 (max)
  13    50.1   29   -      Qwen3.7 Max
  14    48.7   188  -      GPT-5.2 (xhigh)
  15    48.1   132  -      Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  16    47.8   205  -      Claude Opus 4.5 (Reasoning)
  17    47.6   132  -      Claude Opus 4.6 (Non-reasoning, High Effort)
  18    47.5   70   -      Muse Spark
  19    47.5   54   large  DeepSeek V4 Pro (Reasoning, Max Effort)
  20    47.1   58   large  Kimi K2.6
  21    47.1   29   -      Gemini 3.5 Flash (minimal)
  22    46.7   449  -      Gemini 2.5 Pro Preview (Mar' 25)
  23    46.5   211  -      Gemini 3 Pro Preview (high)
  24    46.5   16   -      Qwen3.7 Plus
  25    46.4   120  -      Claude Sonnet 4.6 (Non-reasoning, High Effort)
  26    45.6   5    large  Kimi K2.7 Code
  27    45.6   104  -      GPT-5.4 (low)
  28    45.5   56   large  MiMo-V2.5-Pro
  29    45.1   43   -      GPT-5.5 Instant (May 2026)
  30    45.0   29   -      Gemini 3.5 Flash (high)
  31    44.9   58   -      Qwen3.6 Max Preview
  32    44.7   216  -      GPT-5.1 (high)
  33    44.2   188  -      GPT-5.2 (medium)
  34    44.2   126  large  GLM-5 (Reasoning)
  35    43.9   92   -      GPT-5.4 nano (xhigh)
  36    43.4   71   large  GLM-5.1 (Reasoning)
  37    43.4   16   large  MiniMax-M3
  38    43.2   54   large  DeepSeek V4 Pro (Reasoning, High Effort)
  39    43.0   188  -      GPT-5.2 Codex (xhigh)
  40    42.9   76   -      Qwen3.6 Plus
  41    42.9   205  -      Claude Opus 4.5 (Non-reasoning)
  42    42.6   182  -      Gemini 3 Flash Preview (Reasoning)
  43    42.2   99   -      Grok 4.20 0309 (Reasoning)
  44    42.1   56   large  MiMo-V2.5
  45    41.9   91   large  MiniMax-M2.7
  46    41.4   91   -      MiMo-V2-Pro
  47    41.3   121  large  Qwen3.5 397B A17B (Reasoning)
  48    41.0   48   -      Grok 4.3 (high)
  49    40.5   71   -      Grok 4.20 0309 v2 (Reasoning)
  50    40.5   342  -      Grok 4
  51    39.8   54   large  DeepSeek V4 Flash (Reasoning, High Effort)

A longer curated list based on kristopolous’ list, with more models included. For each model, I kept only the two highest-scoring entries. I used DeepSeek V4 Flash as the cutoff, since I consider it the lowest acceptable model that is still locally deployable.
My observations:

Surprised to see MiniMax M3 so low on that list, not really my experience, I found it smarter than Gemini for a lot of things, that's for sure.

Also surprised to see Gemini 3.1 ranked that high there. It remains IMHO blatantly incompetent for tool use even in their own harnesses, so I can only assume this benchmark isn't ranking workflow things very high. Gemini can write code just fine. It just can't work well as an agent.

GLM 5.2 and Qwen3.7 max were from my experience fairly expensive to use on a per token price and hard to argue in favour of when the SOTA coding plans have a fixed price that makes them potentially more cost effective. (Yes I know z.ai has a coding plan but I've heard reliability nightmare stories, and it's not very cheap)

DeepSeek is clearly the best value for $$. With the right harness and prompting.

you left some models out like DeepSeek and Kimi, for example.
It was a truncated output from the script to demonstrate what it does ...

If you really want to see all of them:

https://day50.dev/output.txt

Or run the script

Because it's not in the top 20 in their benchmark, it's at #23
Short comments...

- GPT 5.5 consistently the best, an opinion who gets me constant downvotes here by the Anthropic Marketeer strike force...

- China is going to eat the US lunch on AI

- What have European universities and companies been doing? Its like if, on a parallel past/future, Nikola Tesla and Edison would have created flying Cyberpunk machines, while Europeans researchers, would be getting together to request EU funds, for investigation on how to breed faster horses.

- If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?

None of these models come from universities, European or otherwise.

Mistral is clearly currently not competing for Frontier Model. Whether this is due to a lack of VC Funds or a lack of technical ability or the former arising from the latter would be interesting to know.

The top models are from startups. Among the FAANG only Google managed to get a Frontier model, and they litterally invented the architecture and have more money than they can possibly spend to throw at the problem. Facebook shows that even ungodly amounts of money don't get you there though.

So why did no EU based Startups succeed while two US start ups succeeded? I agree that that's a very important question the EU should ask. The Internet revolution was driven by US companies, and now AI will be as well, with Chinese Open Weights mixed in. The EU consistently can not turn its considerable economic output into fast moving tech firms.

Mistral have moved to actually trying to make money, and been relatively successful; at least if we lived in a normal world.

They've got a heap of contractors working to help industry adopt LLMs. It is just classic consulting work, and they'd look like a really great company if we weren't comparing them to literal $2T+ companies losing money hand-over-fist...

Apertus was built by universities in Switzerland. Although not frontier it is fully open.

[1] https://apertvs.ai/pages/about/

I'm actually more curious about IBM. Their granite series appears to be nowhere close to competitive.

They had Watson, remember, it won on jeopardy like 15 years ago? They've been at this for a long time

Maybe it's good at something else?

IBM doesn't do technology they do contracts. Any "technology" is marketing stunts. They hire a bunch of "fellows" outside contractors to make a thing they can be first at or whatever, do the stunt, then get a bunch of 5-10 year contracts with customers off the stunt. They then fuck it up for that length of time but still get paid due to those contracts. After that space of time the folks theyve burned have moved on, rinse repeat. Pretty easy to look back at the timeline of "firsts" they have and see the pattern.
Don’t forget the marketing for the new $1B “initiative” (fill in: mobile, cloud, blockchain, AI,…)

Upon closer inspection the $1B is (a) over 10 years, (b) mostly internal cross-billing between departments.

Agree that IBM has no excuse. Specially for how long they have been trying to do AI. Although Watson was a completely different technology.

They had to start from scratch, but dont seem to have the management to be smart enough, to stop doing it in house. They could have just acquired a startup that could build a frontier model.

What is also very ironic since their whole bussiness for the last 15 years, has been buying companies a la CA Associates...

Their previous Watson branding and collapse of Watson expectations cost them one CEO, but the current CEO was part of the same team. They just dont learn....

I view Watson in the same light as Deep Blue, one-offs that brought more prestige and potential share value to IBM than necessarily "moving the needle" in the respective technology.
Granite is OK for speech to text (ASR)
To be honest, living in Switzerland and speaking with peers, we're just exhausted by the constant AI hype. For a lot of us, the fact that Europe isn't frantically trying to scrape the entire internet and every book in existence for the next massive model isn't a bad thing. The big players are doing their thing, like with the nuclear arms race. We regulate a lot, too much a lot of the time, but sometimes that trickles down to other places too. A lot was done right, imo.

ETH Zurich and EPFL universities recently put out an open model called Apertus (was on the HN front page a few months back), it's not a frontier model, but they built it properly regarding copyright and data transparency.

It might look a bit slow or old-fashioned, but focusing on doing things ethically and legally feels like a much better path than just joining the race to scrape everything.

Sir, I would suggest that if Europe fails to be economically competitive, the downstream implications on European society will produce much worse outcomes than (for instance) data transparency…

Doing things with ethical intentions does not necessarily produce outcomes that are beneficial for society at large.

I'm inclined to agree with you, but you could make the same argument for exploiting natural resources and the environment. I don't think it's being done right at the moment, and it does not seem to be benefiting people as much as certain companies.
give me a break.

Europoor is not doing anything. If your lack of AI progress is caused by regulations and respect for IP laws, how about EVs, robotics, drones, batteries, quantum computing. Also slowed down by your over regulations? LOL.

Europoor is called Europoor for a reason, your attitude here is the best explanation on how it happened.

You seem to be confusing Hacker News with 4chan.
They did muse spark ... it's not garbage.

Also what are they building it for? I'd think it's to serve ads better or something like that. Maybe Muse Spark fits facebook's needs perfectly...

Mo Bitar said something like "Meta's LLM is the one you use if you accidentially hit the wrong button in WhatsApp. Its user base is fat-finger phone users."
Understood - they're just doing other things. Maybe custom ad rewriting for a target audience or some kind of deep analytics insight into user behavior or translations that optimizes for maximizing purchasing habits over literary accuracy ... I'm just saying their incentives are elsewhere and maybe Muse is serving them well.

I mean that is the smart move here. Focus the model on optimizing the core business. For Meta, that's not coding tools.

> China is going to eat the US lunch on AI

They will forever have superior weights?

I would imagine it will be a fundamental breakthrough, not weights alone, that are going to usher in the next generation of AI. Perhaps China will in fact make that breakthrough. They certainly seem to have a lot of eyeballs in the field right now.
I think they are already massively winning on efficiency... which is about to matter a lot as the frontier models jack up their prices in order to some day see a profit (and no, Anthropic getting massively subsidized by Elon out of spite doesn't count for long term profits).
I also get the downvotes for the GPT thing, and agree with you about 5.5's quality, but TBH I don't think it's Anthropic marketing as just two other things:

1. SamA and his company has a well-deserved bad reputation and Anthropic got some early good PR for basically not being SamA.

2. Claude Code got early head space, Boris and crew basically "invented" this kind of agent, and so has first mover advantage despite its known reliability and cost issues.

3. Most people I talk to haven't even tried Codex for some reason

Also it's uncool to complain about downvotes.

I downvoted you for your complaining about downvotes fwiw.

And Zuck hasn't spent that much on AI yet. Half of that is projected spending for 2026.

As to whether it's all for nothing, Q1 2026 revenue was up 33% over Q1 last year, driven largely by...better AI-driven ad targeting. So the spending doesn't seem that crazy to me.

Well Europe is famously a laggard when it comes to new tech - in parts of Switzerland, two horses were required be mounted in front to carry cars up until 1925. UK required a person to walk in front of a car and wave a red flag.
"…Anthropic Marketeer strike force…"

Might also just be the result of "good will" (that the company has deftly fostered). Other companies might learn from Anthropic in that regard.

“Good will” is easier if OpenAI is your yardstick
As evil as Google is as a company these days [cough disclaimer, used to work here, so biased] I can't help but think that if Gemini didn't... suck, and if they had a coding model at the same quality as GPT 5.5 or Opus 4.8 they'd be completely cleaning up purely on the basis of relative reputations of the companies.

That Google is dropping the ball so badly, or just disinterested in the coding side of things... is either a sign of incompetence, or a lack of interest in losing money in that space. I wish I knew which.

Consider using decrementing score order (best on top)
then I'd have to scroll up over 500 lines after running it every time to see what I care about.

But if that's your thing, here you go: https://github.com/day50-dev/aa-eval-email/commit/1853be6461...

add an argument (any argument) and it will be sorted as your specified. It just works as a toggle flipping the order ... so literally any string will do.

The original link has been updated accordingly with the new code.

Have it print paginated or just top 10?
only the small ones:

  $ ./art-analysis.sh | grep small
or maybe just the qwen

  $ ./art-analysis.sh | grep Qwen
only the ones in the past 30 days

  $ ./art-analysis.sh | awk '$2 < 31'
I use it in pipes like this.
Cool project! Side note: Kind of a bad practice imo to ask people to blindly execute bash from an unknown source.
Thanks for sharing. I'm curious: why didn't you sort with the score descending?
Because it's currently 511 lines. Why would I want to scroll up to see the stuff I care about? Don't you want the relevant stuff to be right there in front of you?
I do and that's why I pipe the output to `head -n 20` or use `LIMIT 20` in SQL.

That aside, this is a good script you're running. Thanks.

But maybe you decide you want to see more. It makes perfect sense for a cli tool to output the most interesting piece of info last: then you can decide on the fly whether you want to scroll up or not.
Not OP but if you run this from the CLI it does make the ordering make a little more sense
Because programmers can’t figure out how to have a CLI that prints in a normal order, with the newest stuff on top instead of on the bottom.

Setup a fresh new large monitor. Open CLI. Run command. Watch output at the bottom of your screen. Keep watching the bottom of your screen for the rest of the day.

Sure you can tile windows and it helps but come on. Just have the command/input section in the bottom and the “output” on top. Keep the command bit on the bottom.

Maybe your script could sort based on score.
Would be interesting to see where gpt 5.5 pro extended is.