| HN Mirror

Yes there are other tools, but anecdotally and from what I hear from my peers, copilot provides equally mid results. Copilot just has better vscode integration, making it a much nicer product to use.

I haven’t tried codestral yet tbf.

Also, LLM leaderboards are pretty useless. They’re either contrived tests or synthetic benchmarks which don’t reflect reality. We don’t have a good evaluation framework for LLMs yet.

The one you posted, for example, is having models run through 113 small specifically Python coding exercises from the exercism GitHub repo. Those are not representative of the tasks most SWEs deal with AND it’s very likely that those models we’re “contaminated” by being already trained on those exact exercises (since they are open source)

That leaderboard is little better than marketing. Which makes sense since it’s from an AI code assistant company.

EDIT: and even on this extremely minor eval, the best LLMs could do were ~78%. Not exactly what I’d call “good”