Hacker News new | ask | show | jobs
by stingraycharles 35 days ago
> Having used Python on and off for 20 years, my experience with LLMs writing Python has been mixed. I don’t think that’s necessarily because of a low-quality dataset, but rather because Python’s applications are so broad and the language has gone through several paradigm shifts over time

If there’s one language that is the prime example of this, it’s C++, and according to this benchmark it ranks incredibly high.

I’m also thoroughly confused why Kimi 2.6 scores 83% while Opus 4.7 scores 67% for C++, GPT5.5 isn’t even in the top10.

Gemma 4 31B scores 100% success rate for Python (!!) while Opus 4.6 only 65%.

This benchmark really seems to be all over the place and doesn’t make sense.

1 comments

The more filters you apply (single model and single language, especially if you also filter by pipeline like agentic vs one-shot), the fewer samples, so there is variance. Known limitation that is inevitable with any finite budget. This is why we are selective about adding more languages because it will dilute the amount of samples we can run per language per model. But the aggregated statistics hold up well and are very consistent in our testing.
I just applied a single filter, programming language.