Hacker News new | ask | show | jobs
by varun_chopra 366 days ago
At one point, when they made Gemini Pro free on AI Studio, Gemini was the model of choice for many people, I believe.

Somehow it's gotten worse since then, and I'm back to using Claude for serious work.

Gemini is like that guy who keeps talking but has no idea what he's actually talking about.

I still use Gemini for brainstorming, though I take its suggestions with several grains of salt. It's also useful for generating prompts that I can then refine and use with Claude.

12 comments

not according to Aider leaderboard https://aider.chat/docs/leaderboards/

I use only the APIs directly with Aider (so no experience with AI Studio).

My feeling with Claude is that they still perform good with weak prompts, the "taste" is maybe a little better when the direction is kinda unknown by the prompter.

When the direction is known I see Gemini 2.5 Pro (with thinking) on top of Claude with code which does not break. And with o4-mini and o3 I see more "smart" thinking (as if there is a little bit of brain inside these models) at the expense of producing unstable code (Gemini produces more stable code).

I see problems with Claude when complexity increases and I would put it behind Gemini and o3 in my personal ranking.

So far I had no reason to go back to Claude since o3-mini was released.

I just spent $35 for Opus to solve a problem with a hardware side-project (I'm turning an old rotary phone into a meeting handset so I can quit meetings by hanging up, if you must know). It didn't solve the problem, it churned and churned and spent a ton of money.

I was much more satisfied with o3 and Aider, I haven't tried them on this specific problem but I did quite a bit of work on the same project with them last night. I think I'm being a bit unfair, because what Claude got stuck on seems to be a hard problem, but I don't like how they'll happily consume all my money trying the same things over and over, and never say "yeah I give up".

For basically that same price you could get one of these :-)

https://www.amazon.com/Cell2jack-Cellphone-Adapter-Receive-l...

Where's the fun in that?!
Enjoy yourself! Don’t let me spoil your fun :-)
Oh I'm not! I'll post it here when I'm done, it's already hilarious.
wait, you're using a rotary phone ?
I want to!
Give them feedback.
Feedback on what?
When I obtain results from one paid model that are significantly better than what I previously got from another paid model, I'll typically give a thumbs-down to the latter and point out in the comment that it was beaten by a competitor. Can't hurt.
Ah, this wasn't from the web interface, I was using Claude Code. I don't think it has a feedback mechanism.
Using all of the popular coding models pretty extensively over the past year, I've been having great success with Gemini 2.5 Pro as far as getting working code the first time, instruction following around architectural decisions, and staying on-task. I use Aider and write mostly Python, JS, and shell scripts. I've spent hundreds of dollars on the Claude API over time but have switched almost entirely to Gemini. The API itself is also much more reliable.

My only complaint about 2.5 Pro is around the inane comments it leaves in the code (// Deleted varName here).

If you use one of the AI static instructions methods (e.g., .github/copilot-instructions.md) and tell it to not leave the useless comments, that seems to solve the issue.
I've been intending to try some side by side tests with and without a conventions file instructing it not to leave stupid comments—I'm curious to see if somehow they're providing value to the model, e.g. in multi-turn edits.
it's easier to just make it do a code review with focus on removing unhelpful comments instead of asking it not to do it the first time. I do the cleanup after major rounds of work and that strategy seems to work best for me.
This was not my experience with the earlier preview (03), where its insistence on comment spam was too strong to overcome. Wonder if this adherence improved in the 05 or 06 updates.
can you elaborate on this?
I don't mind the comments, I read them while removing them. It's normal to have to adapt the output, change some variable names, refactor a bit. What's impressive is that the output code actually works (or almost). I didn't give it the hardest of problems to solve/code but certainly not easy ones.
Yeah I've mostly just embraced having to remove them as part of a code review, helps focus the review process a bit, really.
I'm using pro for backend and claude for ux work, claude is so much thoughtful about how user interact with software and can usually replicate better the mock up that gpt4o image generator produces, while not being overly fixated on the mockup design itself.

My complaint is that it catches python exceptions and don't log them by default.

And the error handling. God, does it love to insert random try/except statements everywhere.
You feelings of a little brain in there, and stable code are unfounded. All these models collapse pretty fast. If not due to context limit, then in their inability to interpret problems.

An LLM is just statistical regressions with a llztjora of engineering tricks, mostly NLP to produce an illusion.

I don't mean it's useless. I mean comparing these ever evolving models is like comparing escort staff in NYC vs those in L.A, hard to reach any conclusjon. We are getting fooled.

On the price increase, it seems Google was aggressively looking for adoption, Gemini was for a short range of time the best value for money of all the LLMs out there. Adoption likely surged, scaling needs be astronomical, costing Google billions to keep up. The price adjustment could've been expected before they announced it.

Yea, i had similar experiences. At first it felt like it solved complex problems really well, but then i realized i was having trouble steering it for simple things. It was also very verbose.

Overall though my primary concern is the UX, and Claude Code is the UX of choice for me currently.

Check out zen MCP server https://github.com/BeehiveInnovations/zen-mcp-server Lets you use Gemini and OpenAI models in Claude Code.
Ooh this seems nice. Most similar solutions monkeypatch the npm package, which is a bit icky
Same experience here. I even built a Gem with am elaborate prompt instructing it how to be concise, but it still gives annoying long-winded responses and frequently expands the scope of its answer far beyond the prompt.
I feel like this is part of the AI playbook now. Launch a really strong, capable model (expensive price inference) and once users think it’s SOTA, neuter it so the cost is cheaper and most users won’t notice.

The same happened with GPT-3.5. It was so good early on and got worse as OpenAI began to cut costs. I feel like when GPT-4.1 was cloaked as Optimus on Openrouter, it was really good, but once it launched, it also got worse.

That is the capitalism' playbook all along. Its just much faster because its just software. But they do it for everything all the time.
I disagree with the comparison between LLM behavior and traditional software getting worse. When regular software declines in quality, it’s usually noticeable through UI changes, release notes, or other signals. Companies often don’t bother hiding it, since their users are typically locked into their ecosystem.

LLMs, on the other hand, operate under different incentives. It’s in a company’s best interest to initially release the strongest model, top the benchmarks, and then quietly degrade performance over time. Unlike traditional software, LLMs have low switching costs, users can easily jump to a better alternative. That makes it more tempting for companies to conceal model downgrades to prevent user churn.

> When regular software declines in quality, it’s usually noticeable through UI changes, release notes, or other signals.

Counterexample: 99% of average Joes have no idea how incredibly enshittified Google Maps has become, to just name one app. These companies intentionally boil the frog very slowly, and most people are incredibly bad at noticing gradual changes (see global warming).

Sure, they could know by comparing, but you could also know whether models are changing behind the scenes by having sets of evals.

This is where switching costs matter. Take Google Maps, many people can’t switch to another app. In some areas, it’s the only app with accurate data, so Google can degrade the experience without losing users.

We can tell it’s getting worse because of UI changes, slower load times, and more ads. The signs are visible.

With LLMs, it’s different. There are no clear cues when quality drops. If responses seem off, users often blame their own prompts. That makes it easier for companies to quietly lower performance.

That said, many of us on HN use LLMs mainly for coding, so we can tell when things get worse.

Both cases involve the “boiling frog” effect, but with LLMs, users can easily jump to another pot. With traditional software, switching is much harder.

Do you mind explaining how you see this working as a nefarious plot? I don't see an upside in this case so I'm going with the old "never ascribe to malice" etc
I have no inside information but feels like they quantized it. I've seen patterns that I usually only see in quantized models like getting stuck repeating a single character indefinitely
They should just roll back to the preview versions. Those were so much more even keeled and actually did some useful pushback instead of this cheerleader-on-steroids version they GA'd.
Yes I was very surprised after the whole "scandal" around ChatGPT becoming too sycophantic that there was this massive change in tone from the last preview model (05-06) to the 06-05/GA model. The tone is really off-putting, I really liked how the preview versions felt like intelligent conversation partners and recognize what you're saying about useful pushback - it was my favorite set of models (the few preview iterations before this one) and I'm sad to see them disappearing.

Many people on the Google AI Developer forums have also noted either bugs or just performance regression in the final model.

But they claim it's the same model and version?
I don't know but it sure doesn't feel the same. I have been using Gemini 2.5 pro (preview and now GA) for a while. The difference in tone is palpable. I also noticed that the preview took longer time and the GA is faster so it could be quantization.

Maybe a bunch of people with authority to decide thought that it was too slow/expensive/boring and screwed up a nice thing.

They made it talk like buzzfeed articles for every single interaction. It's absolutely horrible
I found Gemini now terrible for coding. I gave it my code blocks and told it what to change and it added tonnes and tonnes of needles extra code plus endless comments. It turned a tight code into a Papyrus.

ChatGPT is better but tends to be too agreeable, never trying to disagree with what you say even if it's stupid so you end up shooting yourself in the foot.

Claude seems like the best compromise.

Just my two kopecks.

Used to be able to use Gemini Pro free in cline. Now the API limits are so low that you immediately get messages about needing to top up your wallet and API queries just don't go through. Back to using DeepSeek R1 free in cline (though even that eventually stops after a few hours and you have to wait until the next day for it to work again). Starting to look like I need to setup a local LLM for coding - which means it's time to seriously upgrade my PC (well, it's been about 10 years so it was getting to be time anyway)
By the time you breakeven on whatever you spend on a decent LLM capable build, your hardware will be too far behind to run whatever is best locally then. It's something that feels cheaper but with the pace of things, unless you are churning an insane amount of tokens, probably doesn't make sense. Never mind that local models running on 24 or 48GB are maybe around flash-lite in ability while being slower than SOTA models.

Local models are mostly for hobby and privacy, not really efficiency.

When I ask it do to do something in cursor it goes full sherlock thinking about every possible outcome.

Just claude 4 sonnet with thinking just has a bit think then does it

Same for me. I've been using Gemini 2.5 Pro for the past week or so because people said Gemini is the best for coding! Not at all my experience with Gemini 2.5 Pro, on top of being slow and flaky, the responses are kind of bad. Claud Sonnet 4 is much better IMO.
They nerfed Pro 2.5 significantly in the last few months. Early this year, I had genuinely insightful conversations with Gemini 2.5 Pro. Now they are mostly frustrating.

I also have a personal conspiracy theory, i.e., that once a user exceeds a certain use threshold of 2.5 Pro in the Google Gemini app, they start serving a quantized version. Of course, I have no proof, but it certainly feels that way.

Maybe they've been focusing so much on improving coding performance with RL for the new versions/previews that other areas degraded in performance
I think you are right and this is probably the case.

Although, given that I rapidly went from +4 to 0 karma, a few other comments in this topic are grey, and at least one is missing, I am getting suspicious. (Or maybe it is just lunch time in MTV.)

There was a significant nerf of Gemini 3-25 a little while ago, so much so that I detected it without knowing there was even a new release.

Totally convinced they quantized the model quietly and improved on the coding benchmark to hide that fact.

I’m frankly quite tired of LLM providers changing the model I’m paying for access to behind the scenes, often without informing me, and in Gemini’s case on the API too—at least last time I checked they updated the 3-25 checkpoint to the May update.

One of the early updates improved agentic coding scores while lowering other general benchmark scores, which may have impacted those kind of conversations.
I wonder how smart they are about quantizing. Do they look at feedback to decide which users won't mind?
The context window on ai studio feels endless.

All other ai’s seem to give me errors when working with large bodies of code.