| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by smoe 38 days ago
	Having used Python on and off for 20 years, my experience with LLMs writing Python has been mixed. I don’t think that’s necessarily because of a low-quality dataset, but rather because Python’s applications are so broad and the language has gone through several paradigm shifts over time: sync vs. async, typed vs. untyped, scientific Python looking very different from web application code, some people really wishing it were an FP language, and others doing the clean-architecture OOP onion soup. It has gotten so fragmented. Recently, I had a more pleasant experience using LLMs with Go. It reminds me a bit of Python 2.x, when the community seemed, in my view, more focused on embracing a stupid simple language, with everyone trying to write roughly similar "Pythonic" code.

1 comments

stingraycharles 38 days ago

> Having used Python on and off for 20 years, my experience with LLMs writing Python has been mixed. I don’t think that’s necessarily because of a low-quality dataset, but rather because Python’s applications are so broad and the language has gone through several paradigm shifts over time

If there’s one language that is the prime example of this, it’s C++, and according to this benchmark it ranks incredibly high.

I’m also thoroughly confused why Kimi 2.6 scores 83% while Opus 4.7 scores 67% for C++, GPT5.5 isn’t even in the top10.

Gemma 4 31B scores 100% success rate for Python (!!) while Opus 4.6 only 65%.

This benchmark really seems to be all over the place and doesn’t make sense.

link

gertlabs 38 days ago

The more filters you apply (single model and single language, especially if you also filter by pipeline like agentic vs one-shot), the fewer samples, so there is variance. Known limitation that is inevitable with any finite budget. This is why we are selective about adding more languages because it will dilute the amount of samples we can run per language per model. But the aggregated statistics hold up well and are very consistent in our testing.

link

stingraycharles 38 days ago

I just applied a single filter, programming language.

link