|
|
|
|
|
by smoe
38 days ago
|
|
Having used Python on and off for 20 years, my experience with LLMs writing Python has been mixed. I don’t think that’s necessarily because of a low-quality dataset, but rather because Python’s applications are so broad and the language has gone through several paradigm shifts over time: sync vs. async, typed vs. untyped, scientific Python looking very different from web application code, some people really wishing it were an FP language, and others doing the clean-architecture OOP onion soup. It has gotten so fragmented. Recently, I had a more pleasant experience using LLMs with Go. It reminds me a bit of Python 2.x, when the community seemed, in my view, more focused on embracing a stupid simple language, with everyone trying to write roughly similar "Pythonic" code. |
|
If there’s one language that is the prime example of this, it’s C++, and according to this benchmark it ranks incredibly high.
I’m also thoroughly confused why Kimi 2.6 scores 83% while Opus 4.7 scores 67% for C++, GPT5.5 isn’t even in the top10.
Gemma 4 31B scores 100% success rate for Python (!!) while Opus 4.6 only 65%.
This benchmark really seems to be all over the place and doesn’t make sense.