| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by OutOfHere 131 days ago
	OpenAI in my estimation has the habit of dropping a model's quality after its introduction. I definitely recall the web ChatGPT 5.2 being a lot better when it was introduced. A week or two later, its quality suddenly dropped. The initial high looked to be to throw off journalists and benchmarks. As such, nothing that OpenAI says in terms of model speed can be trusted. All they have to do is lower the reasoning effort on average, and boom, it becomes 40% faster. I hope I am wrong, because if I am right, it's a con game. Starting off the ChatGPT Plus web users with the Pro model, then later swapping it for the Standard model -- would meet the claims of model behavior consistency, while still qualifying as shenanigans.

5 comments

tedsanders 131 days ago

It's good to be skeptical, but I'm happy to share that we don't pull shenanigans like this. We actually take quite a bit of care to report evals fairly, keep API model behavior constant, and track down reports of degraded performance in case we've accidentally introduced bugs. If we were degrading model behavior, it would be pretty easy to catch us with evals against our API.

In this particular case, I'm happy to report that the speedup is time per token, so it's not a gimmick from outputting fewer tokens at lower reasoning effort. Model weights and quality remain the same.

deaux 131 days ago

It looks like you do pull shenanigans like these [0]. The person you're replying to even mentioned "ChatGPT 5.2", but you're specifically talking only about the API, while making it sound like it applies across the board. Also appreciate the attempt to further hide this degradation of the product they paid for from users by blocking the prompt used to figure this out.

Happy to retract if you can state [0] is false.

[0] https://x.com/btibor91/status/2018754586123890717

tedsanders 131 days ago

Yes, independent of the API speedup, we also recently reduced the thinking effort in ChatGPT. Our intent here was purely user experience, not cost savings. People have complained about the slow speeds of the Thinking models for a long time (myself included), so we recently retuned it to be faster, at the expense of less thoroughness.

I won't BS you that costs are never part of our decision making. If costs didn't matter, we'd have unlimited rate limits and 10M token context windows and subscription pricing of $0. But as someone in the room where these decisions are made, I can honestly report that our goal is almost always trying to figure out how to make people happier, not trick them. We're trying to fairly earn subscriptions, not scam anyone. In the cases where we have accidentally misled people (e.g., saying voice mode was weeks away), it was optimistic planning, not nefarious intent.

API model behavior is guaranteed to nearly stay the same (modulo standard non-determinism, bugs, etc.). ChatGPT is harder to promise, not because we pull more shenanigans there, but just because we might tweak system prompts, add/remove tools, run A/B tests, etc. that vary performance a bit. But we definitely don't do things like quantize during busy parts of the day or nerf models after publishing evals - that would feel pretty shady.

offnominal 130 days ago

Did they reduce thinking effort on Codex too? It seems to have become significantly worse in the past couple of days. It keeps making dumb mistakes (that it wouldn't earlier), so my chats are much longer to get it to fix them. That might be more expensive for OpenAI (and me!).

empath75 131 days ago

Chatgpt 5.2 in the past couple of weeks has gotten noticeably worse for me to the point that I stopped using it and just ask claude code questions instead.

quinncom 130 days ago

I’m so disappointed by this. It’s immediately noticeable that the results for the types of queries I make are worse. Queries using 5.2 Thinking now return very quickly, but with noticeably worse results.

tedsanders 130 days ago

It's unfortunately hard to make everyone happy. For now we're going to keep the default where it is, but we'll bump extended back up so that people can still get longer reasoning when they want it.

pickleRick243 128 days ago

This makes no sense. Why lower extended thinking time? Those who want faster answers can just use standard. The only purpose this serves is to "trick" the user into thinking he's still receiving "extended thinking" level answers at faster speed.

virgildotcodes 131 days ago

Would love a direct response to this.

jiggawatts 131 days ago

I've seen Sam Altman make similar claims in interviews, and I now interpret every statement from an Open AI employee (and especially Sam) as if an Aes Sedai had said it.

I.e.: "keep API model behavior constant" says nothing about the consumer ChatGPT web app, mobile apps, third-party integrations, etc.

Similarly, it might mean very specifically that a "certain model timestamp" remains constant but the generic "-latest" or whatever model name auto-updates "for your convenience" to the new faster performance achieved through quantisation or reduced thinking time.

You might be telling the full, unvarnished truth, but after many similar claims from OpenAI that turned out to be only technically true, I remain sceptical.

tedsanders 131 days ago

That's a fair suspicion - I'll freely acknowledge that I am biased towards saying things that are simple and known, and I steer away from topics that feel too proprietary, messy, etc.

ChatGPT model behavior can definitely change over time. We share release notes here (https://help.openai.com/en/articles/6825453-chatgpt-release-...), and we also make changes or run A/B tests that aren't reported there. Plus, ChatGPT has memory, so as you use it, its behavior can technically change even with no changes on our end.

That said, I do my best to be honest and communicate the way that I would want someone to communicate with me.

OutOfHere 131 days ago

Starting off the ChatGPT Plus web users with the Pro model, then later swapping it for the Standard model -- would meet the claims of model behavior consistency, while still qualifying as shenanigans.

zamadatix 131 days ago

Hey Ted, can you confirm whether this 40% improvement is specific to API customers or if that's just a wording thing because this is the OpenAI Developers account posting?

tedsanders 131 days ago

It's specific to the API.

wahnfrieden 131 days ago

You're confirming you don't alter "juice" levels..?

tedsanders 131 days ago

No, we did adjust the thinking levels in ChatGPT recently, but it was motivated by trying to improve the product based on what users told us, not cost savings. I wrote a bit more here: https://news.ycombinator.com/item?id=46887150

8note 131 days ago

so what actually happens if it isnt shenanigans?

its worth you guys doing on your end, some analysis of why customers are getting worse results a week or two later, and putting out some guidelines about what context is poisonous and the like

benterix 130 days ago

> I hope I am wrong, because if I am right, it's a con game.

I don't think they perceive it as a con game, on the contrary. They say below: "we also recently reduced the thinking effort in ChatGPT. Our intent here was purely user experience, not cost savings."

They are not the only ones playing this game. Google did the same with Gemini Pro.

scrollop 131 days ago

OpenAI isn't the only one:

Anthropic:

https://marginlab.ai/trackers/claude-code/

jxmesth 131 days ago

Someone should create a daily benchmark site for Codex like they did for Claude

OutOfHere 131 days ago

I see https://marginlab.ai/trackers/codex/

bethekidyouwant 131 days ago

I mean you can just run the benchmark again

OutOfHere 131 days ago

How are you going to benchmark the web ChatGPT Plus, which is where a reduction was suspected?