Hacker News new | ask | show | jobs
by willseth 366 days ago
Same experience here. I even built a Gem with am elaborate prompt instructing it how to be concise, but it still gives annoying long-winded responses and frequently expands the scope of its answer far beyond the prompt.
1 comments

I feel like this is part of the AI playbook now. Launch a really strong, capable model (expensive price inference) and once users think it’s SOTA, neuter it so the cost is cheaper and most users won’t notice.

The same happened with GPT-3.5. It was so good early on and got worse as OpenAI began to cut costs. I feel like when GPT-4.1 was cloaked as Optimus on Openrouter, it was really good, but once it launched, it also got worse.

That is the capitalism' playbook all along. Its just much faster because its just software. But they do it for everything all the time.
I disagree with the comparison between LLM behavior and traditional software getting worse. When regular software declines in quality, it’s usually noticeable through UI changes, release notes, or other signals. Companies often don’t bother hiding it, since their users are typically locked into their ecosystem.

LLMs, on the other hand, operate under different incentives. It’s in a company’s best interest to initially release the strongest model, top the benchmarks, and then quietly degrade performance over time. Unlike traditional software, LLMs have low switching costs, users can easily jump to a better alternative. That makes it more tempting for companies to conceal model downgrades to prevent user churn.

> When regular software declines in quality, it’s usually noticeable through UI changes, release notes, or other signals.

Counterexample: 99% of average Joes have no idea how incredibly enshittified Google Maps has become, to just name one app. These companies intentionally boil the frog very slowly, and most people are incredibly bad at noticing gradual changes (see global warming).

Sure, they could know by comparing, but you could also know whether models are changing behind the scenes by having sets of evals.

This is where switching costs matter. Take Google Maps, many people can’t switch to another app. In some areas, it’s the only app with accurate data, so Google can degrade the experience without losing users.

We can tell it’s getting worse because of UI changes, slower load times, and more ads. The signs are visible.

With LLMs, it’s different. There are no clear cues when quality drops. If responses seem off, users often blame their own prompts. That makes it easier for companies to quietly lower performance.

That said, many of us on HN use LLMs mainly for coding, so we can tell when things get worse.

Both cases involve the “boiling frog” effect, but with LLMs, users can easily jump to another pot. With traditional software, switching is much harder.

Do you mind explaining how you see this working as a nefarious plot? I don't see an upside in this case so I'm going with the old "never ascribe to malice" etc