Hacker News new | ask | show | jobs
by dartos 703 days ago
> In the coming years, we’re likely to see coding assistants like Copilot become more and more autonomous. Instead of completing your code, the new systems will write new functions given a specification. Instead of writing new functions, they will make small changes to a handful of files. Instead of making small changes, they’ll build new functionality with increasing complexity. They’ll do complex refactors, then start codebases from scratch, then manage projects from scratch.

I see literally 0 evidence of this sort of trend.

Copilot is still BY FAR the best AI coding tool, and it’s just ok and hasn’t improved much since release IMO.

We don’t have infinite code to train models on, we’ve definitely trained GPT on about everything in GitHub and probably gitlab.

I’d be very surprised to see this level of code assistants with our current AI methods.

1 comments

Is it? there's also Cursor.sh/Mentat/Aider/rift/Continue.dev

Aider ranks the LLM engines, with claude-3.5-sonnet coming out on top, but doesn't test against Copilot

https://aider.chat/docs/leaderboards/

Yes there are other tools, but anecdotally and from what I hear from my peers, copilot provides equally mid results. Copilot just has better vscode integration, making it a much nicer product to use.

I haven’t tried codestral yet tbf.

Also, LLM leaderboards are pretty useless. They’re either contrived tests or synthetic benchmarks which don’t reflect reality. We don’t have a good evaluation framework for LLMs yet.

The one you posted, for example, is having models run through 113 small specifically Python coding exercises from the exercism GitHub repo. Those are not representative of the tasks most SWEs deal with AND it’s very likely that those models we’re “contaminated” by being already trained on those exact exercises (since they are open source)

That leaderboard is little better than marketing. Which makes sense since it’s from an AI code assistant company.

EDIT: and even on this extremely minor eval, the best LLMs could do were ~78%. Not exactly what I’d call “good”