Hacker News new | ask | show | jobs
by therealmarv 366 days ago
not according to Aider leaderboard https://aider.chat/docs/leaderboards/

I use only the APIs directly with Aider (so no experience with AI Studio).

My feeling with Claude is that they still perform good with weak prompts, the "taste" is maybe a little better when the direction is kinda unknown by the prompter.

When the direction is known I see Gemini 2.5 Pro (with thinking) on top of Claude with code which does not break. And with o4-mini and o3 I see more "smart" thinking (as if there is a little bit of brain inside these models) at the expense of producing unstable code (Gemini produces more stable code).

I see problems with Claude when complexity increases and I would put it behind Gemini and o3 in my personal ranking.

So far I had no reason to go back to Claude since o3-mini was released.

3 comments

I just spent $35 for Opus to solve a problem with a hardware side-project (I'm turning an old rotary phone into a meeting handset so I can quit meetings by hanging up, if you must know). It didn't solve the problem, it churned and churned and spent a ton of money.

I was much more satisfied with o3 and Aider, I haven't tried them on this specific problem but I did quite a bit of work on the same project with them last night. I think I'm being a bit unfair, because what Claude got stuck on seems to be a hard problem, but I don't like how they'll happily consume all my money trying the same things over and over, and never say "yeah I give up".

For basically that same price you could get one of these :-)

https://www.amazon.com/Cell2jack-Cellphone-Adapter-Receive-l...

Where's the fun in that?!
Enjoy yourself! Don’t let me spoil your fun :-)
Oh I'm not! I'll post it here when I'm done, it's already hilarious.
wait, you're using a rotary phone ?
I want to!
Give them feedback.
Feedback on what?
When I obtain results from one paid model that are significantly better than what I previously got from another paid model, I'll typically give a thumbs-down to the latter and point out in the comment that it was beaten by a competitor. Can't hurt.
Ah, this wasn't from the web interface, I was using Claude Code. I don't think it has a feedback mechanism.
Using all of the popular coding models pretty extensively over the past year, I've been having great success with Gemini 2.5 Pro as far as getting working code the first time, instruction following around architectural decisions, and staying on-task. I use Aider and write mostly Python, JS, and shell scripts. I've spent hundreds of dollars on the Claude API over time but have switched almost entirely to Gemini. The API itself is also much more reliable.

My only complaint about 2.5 Pro is around the inane comments it leaves in the code (// Deleted varName here).

If you use one of the AI static instructions methods (e.g., .github/copilot-instructions.md) and tell it to not leave the useless comments, that seems to solve the issue.
I've been intending to try some side by side tests with and without a conventions file instructing it not to leave stupid comments—I'm curious to see if somehow they're providing value to the model, e.g. in multi-turn edits.
it's easier to just make it do a code review with focus on removing unhelpful comments instead of asking it not to do it the first time. I do the cleanup after major rounds of work and that strategy seems to work best for me.
This was not my experience with the earlier preview (03), where its insistence on comment spam was too strong to overcome. Wonder if this adherence improved in the 05 or 06 updates.
can you elaborate on this?
I don't mind the comments, I read them while removing them. It's normal to have to adapt the output, change some variable names, refactor a bit. What's impressive is that the output code actually works (or almost). I didn't give it the hardest of problems to solve/code but certainly not easy ones.
Yeah I've mostly just embraced having to remove them as part of a code review, helps focus the review process a bit, really.
I'm using pro for backend and claude for ux work, claude is so much thoughtful about how user interact with software and can usually replicate better the mock up that gpt4o image generator produces, while not being overly fixated on the mockup design itself.

My complaint is that it catches python exceptions and don't log them by default.

And the error handling. God, does it love to insert random try/except statements everywhere.
You feelings of a little brain in there, and stable code are unfounded. All these models collapse pretty fast. If not due to context limit, then in their inability to interpret problems.

An LLM is just statistical regressions with a llztjora of engineering tricks, mostly NLP to produce an illusion.

I don't mean it's useless. I mean comparing these ever evolving models is like comparing escort staff in NYC vs those in L.A, hard to reach any conclusjon. We are getting fooled.

On the price increase, it seems Google was aggressively looking for adoption, Gemini was for a short range of time the best value for money of all the LLMs out there. Adoption likely surged, scaling needs be astronomical, costing Google billions to keep up. The price adjustment could've been expected before they announced it.