| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by varun_chopra 366 days ago

At one point, when they made Gemini Pro free on AI Studio, Gemini was the model of choice for many people, I believe.

Somehow it's gotten worse since then, and I'm back to using Claude for serious work.

Gemini is like that guy who keeps talking but has no idea what he's actually talking about.

I still use Gemini for brainstorming, though I take its suggestions with several grains of salt. It's also useful for generating prompts that I can then refine and use with Claude.

12 comments

therealmarv 366 days ago

not according to Aider leaderboard https://aider.chat/docs/leaderboards/

I use only the APIs directly with Aider (so no experience with AI Studio).

My feeling with Claude is that they still perform good with weak prompts, the "taste" is maybe a little better when the direction is kinda unknown by the prompter.

When the direction is known I see Gemini 2.5 Pro (with thinking) on top of Claude with code which does not break. And with o4-mini and o3 I see more "smart" thinking (as if there is a little bit of brain inside these models) at the expense of producing unstable code (Gemini produces more stable code).

I see problems with Claude when complexity increases and I would put it behind Gemini and o3 in my personal ranking.

So far I had no reason to go back to Claude since o3-mini was released.

stavros 366 days ago

I just spent $35 for Opus to solve a problem with a hardware side-project (I'm turning an old rotary phone into a meeting handset so I can quit meetings by hanging up, if you must know). It didn't solve the problem, it churned and churned and spent a ton of money.

I was much more satisfied with o3 and Aider, I haven't tried them on this specific problem but I did quite a bit of work on the same project with them last night. I think I'm being a bit unfair, because what Claude got stuck on seems to be a hard problem, but I don't like how they'll happily consume all my money trying the same things over and over, and never say "yeah I give up".

antgiant 366 days ago

For basically that same price you could get one of these :-)

https://www.amazon.com/Cell2jack-Cellphone-Adapter-Receive-l...

stavros 366 days ago

Where's the fun in that?!

antgiant 366 days ago

Enjoy yourself! Don’t let me spoil your fun :-)

stavros 366 days ago

Oh I'm not! I'll post it here when I'm done, it's already hilarious.

sans_souse 366 days ago

wait, you're using a rotary phone ?

stavros 366 days ago

I want to!

alecco 366 days ago

Give them feedback.

stavros 366 days ago

Feedback on what?

CamperBob2 366 days ago

When I obtain results from one paid model that are significantly better than what I previously got from another paid model, I'll typically give a thumbs-down to the latter and point out in the comment that it was beaten by a competitor. Can't hurt.

stavros 366 days ago

Ah, this wasn't from the web interface, I was using Claude Code. I don't think it has a feedback mechanism.

macNchz 366 days ago

Using all of the popular coding models pretty extensively over the past year, I've been having great success with Gemini 2.5 Pro as far as getting working code the first time, instruction following around architectural decisions, and staying on-task. I use Aider and write mostly Python, JS, and shell scripts. I've spent hundreds of dollars on the Claude API over time but have switched almost entirely to Gemini. The API itself is also much more reliable.

My only complaint about 2.5 Pro is around the inane comments it leaves in the code (// Deleted varName here).

ZeWaka 366 days ago

If you use one of the AI static instructions methods (e.g., .github/copilot-instructions.md) and tell it to not leave the useless comments, that seems to solve the issue.

macNchz 366 days ago

I've been intending to try some side by side tests with and without a conventions file instructing it not to leave stupid comments—I'm curious to see if somehow they're providing value to the model, e.g. in multi-turn edits.

luckydata 366 days ago

it's easier to just make it do a code review with focus on removing unhelpful comments instead of asking it not to do it the first time. I do the cleanup after major rounds of work and that strategy seems to work best for me.

jjani 366 days ago

This was not my experience with the earlier preview (03), where its insistence on comment spam was too strong to overcome. Wonder if this adherence improved in the 05 or 06 updates.

sans_souse 366 days ago

can you elaborate on this?

dominicrose 365 days ago

I don't mind the comments, I read them while removing them. It's normal to have to adapt the output, change some variable names, refactor a bit. What's impressive is that the output code actually works (or almost). I didn't give it the hardest of problems to solve/code but certainly not easy ones.

macNchz 365 days ago

Yeah I've mostly just embraced having to remove them as part of a code review, helps focus the review process a bit, really.

avereveard 366 days ago

I'm using pro for backend and claude for ux work, claude is so much thoughtful about how user interact with software and can usually replicate better the mock up that gpt4o image generator produces, while not being overly fixated on the mockup design itself.

My complaint is that it catches python exceptions and don't log them by default.

miki123211 365 days ago

And the error handling. God, does it love to insert random try/except statements everywhere.

hirako2000 366 days ago

You feelings of a little brain in there, and stable code are unfounded. All these models collapse pretty fast. If not due to context limit, then in their inability to interpret problems.

An LLM is just statistical regressions with a llztjora of engineering tricks, mostly NLP to produce an illusion.

I don't mean it's useless. I mean comparing these ever evolving models is like comparing escort staff in NYC vs those in L.A, hard to reach any conclusjon. We are getting fooled.

On the price increase, it seems Google was aggressively looking for adoption, Gemini was for a short range of time the best value for money of all the LLMs out there. Adoption likely surged, scaling needs be astronomical, costing Google billions to keep up. The price adjustment could've been expected before they announced it.

unshavedyak 366 days ago

Yea, i had similar experiences. At first it felt like it solved complex problems really well, but then i realized i was having trouble steering it for simple things. It was also very verbose.

Overall though my primary concern is the UX, and Claude Code is the UX of choice for me currently.

sagarpatil 366 days ago

Check out zen MCP server https://github.com/BeehiveInnovations/zen-mcp-server Lets you use Gemini and OpenAI models in Claude Code.

cap11235 365 days ago

Ooh this seems nice. Most similar solutions monkeypatch the npm package, which is a bit icky

willseth 366 days ago

Same experience here. I even built a Gem with am elaborate prompt instructing it how to be concise, but it still gives annoying long-winded responses and frequently expands the scope of its answer far beyond the prompt.

theturtletalks 366 days ago

I feel like this is part of the AI playbook now. Launch a really strong, capable model (expensive price inference) and once users think it’s SOTA, neuter it so the cost is cheaper and most users won’t notice.

The same happened with GPT-3.5. It was so good early on and got worse as OpenAI began to cut costs. I feel like when GPT-4.1 was cloaked as Optimus on Openrouter, it was really good, but once it launched, it also got worse.

carlos22 366 days ago

That is the capitalism' playbook all along. Its just much faster because its just software. But they do it for everything all the time.

theturtletalks 366 days ago

I disagree with the comparison between LLM behavior and traditional software getting worse. When regular software declines in quality, it’s usually noticeable through UI changes, release notes, or other signals. Companies often don’t bother hiding it, since their users are typically locked into their ecosystem.

LLMs, on the other hand, operate under different incentives. It’s in a company’s best interest to initially release the strongest model, top the benchmarks, and then quietly degrade performance over time. Unlike traditional software, LLMs have low switching costs, users can easily jump to a better alternative. That makes it more tempting for companies to conceal model downgrades to prevent user churn.

jjani 366 days ago

> When regular software declines in quality, it’s usually noticeable through UI changes, release notes, or other signals.

Counterexample: 99% of average Joes have no idea how incredibly enshittified Google Maps has become, to just name one app. These companies intentionally boil the frog very slowly, and most people are incredibly bad at noticing gradual changes (see global warming).

Sure, they could know by comparing, but you could also know whether models are changing behind the scenes by having sets of evals.

theturtletalks 366 days ago

This is where switching costs matter. Take Google Maps, many people can’t switch to another app. In some areas, it’s the only app with accurate data, so Google can degrade the experience without losing users.

We can tell it’s getting worse because of UI changes, slower load times, and more ads. The signs are visible.

With LLMs, it’s different. There are no clear cues when quality drops. If responses seem off, users often blame their own prompts. That makes it easier for companies to quietly lower performance.

That said, many of us on HN use LLMs mainly for coding, so we can tell when things get worse.

Both cases involve the “boiling frog” effect, but with LLMs, users can easily jump to another pot. With traditional software, switching is much harder.

andybak 366 days ago

Do you mind explaining how you see this working as a nefarious plot? I don't see an upside in this case so I'm going with the old "never ascribe to malice" etc

jasonjmcghee 366 days ago

I have no inside information but feels like they quantized it. I've seen patterns that I usually only see in quantized models like getting stuck repeating a single character indefinitely

noisy_boy 366 days ago

They should just roll back to the preview versions. Those were so much more even keeled and actually did some useful pushback instead of this cheerleader-on-steroids version they GA'd.

samvher 365 days ago

Yes I was very surprised after the whole "scandal" around ChatGPT becoming too sycophantic that there was this massive change in tone from the last preview model (05-06) to the 06-05/GA model. The tone is really off-putting, I really liked how the preview versions felt like intelligent conversation partners and recognize what you're saying about useful pushback - it was my favorite set of models (the few preview iterations before this one) and I'm sad to see them disappearing.

Many people on the Google AI Developer forums have also noted either bugs or just performance regression in the final model.

k8sToGo 366 days ago

But they claim it's the same model and version?

noisy_boy 366 days ago

I don't know but it sure doesn't feel the same. I have been using Gemini 2.5 pro (preview and now GA) for a while. The difference in tone is palpable. I also noticed that the preview took longer time and the GA is faster so it could be quantization.

Maybe a bunch of people with authority to decide thought that it was too slow/expensive/boring and screwed up a nice thing.

huevosabio 366 days ago

They made it talk like buzzfeed articles for every single interaction. It's absolutely horrible

FirmwareBurner 366 days ago

I found Gemini now terrible for coding. I gave it my code blocks and told it what to change and it added tonnes and tonnes of needles extra code plus endless comments. It turned a tight code into a Papyrus.

ChatGPT is better but tends to be too agreeable, never trying to disagree with what you say even if it's stupid so you end up shooting yourself in the foot.

Claude seems like the best compromise.

Just my two kopecks.

UncleOxidant 366 days ago

Used to be able to use Gemini Pro free in cline. Now the API limits are so low that you immediately get messages about needing to top up your wallet and API queries just don't go through. Back to using DeepSeek R1 free in cline (though even that eventually stops after a few hours and you have to wait until the next day for it to work again). Starting to look like I need to setup a local LLM for coding - which means it's time to seriously upgrade my PC (well, it's been about 10 years so it was getting to be time anyway)

Workaccount2 366 days ago

By the time you breakeven on whatever you spend on a decent LLM capable build, your hardware will be too far behind to run whatever is best locally then. It's something that feels cheaper but with the pace of things, unless you are churning an insane amount of tokens, probably doesn't make sense. Never mind that local models running on 24 or 48GB are maybe around flash-lite in ability while being slower than SOTA models.

Local models are mostly for hobby and privacy, not really efficiency.

chrismustcode 366 days ago

When I ask it do to do something in cursor it goes full sherlock thinking about every possible outcome.

Just claude 4 sonnet with thinking just has a bit think then does it

DangitBobby 366 days ago

Same for me. I've been using Gemini 2.5 Pro for the past week or so because people said Gemini is the best for coding! Not at all my experience with Gemini 2.5 Pro, on top of being slow and flaky, the responses are kind of bad. Claud Sonnet 4 is much better IMO.

dr_kiszonka 366 days ago

They nerfed Pro 2.5 significantly in the last few months. Early this year, I had genuinely insightful conversations with Gemini 2.5 Pro. Now they are mostly frustrating.

I also have a personal conspiracy theory, i.e., that once a user exceeds a certain use threshold of 2.5 Pro in the Google Gemini app, they start serving a quantized version. Of course, I have no proof, but it certainly feels that way.

conradkay 366 days ago

Maybe they've been focusing so much on improving coding performance with RL for the new versions/previews that other areas degraded in performance

dr_kiszonka 366 days ago

I think you are right and this is probably the case.

Although, given that I rapidly went from +4 to 0 karma, a few other comments in this topic are grey, and at least one is missing, I am getting suspicious. (Or maybe it is just lunch time in MTV.)

SirensOfTitan 365 days ago

There was a significant nerf of Gemini 3-25 a little while ago, so much so that I detected it without knowing there was even a new release.

Totally convinced they quantized the model quietly and improved on the coding benchmark to hide that fact.

I’m frankly quite tired of LLM providers changing the model I’m paying for access to behind the scenes, often without informing me, and in Gemini’s case on the API too—at least last time I checked they updated the 3-25 checkpoint to the May update.

cma 365 days ago

One of the early updates improved agentic coding scores while lowering other general benchmark scores, which may have impacted those kind of conversations.

esafak 366 days ago

I wonder how smart they are about quantizing. Do they look at feedback to decide which users won't mind?

r0fl 365 days ago

The context window on ai studio feels endless.

All other ai’s seem to give me errors when working with large bodies of code.