Hacker News new | ask | show | jobs
Ask HN: Is GPT 4's quality lately worst than GPT 3.5?
51 points by agonz253 1052 days ago
Has anyone else encountered this phenomenon lately? I've found myself prompting GPT 3.5 with simple questions that GPT 4 provided an incorrect answer for, and lo and behold I get a much better answer.

For ex this is GPT 4: https://chat.openai.com/share/e24501ad-8f1c-4b5a-a6d0-d933f5d1d209

And this is GPT 3.5: https://chat.openai.com/share/b9372bdc-ffff-4655-bee4-2b3f3c3b8285

In the latter case I didn't even need to ask for the order by clause as it anticipates it and provides an answer for it. GPT 4's first answer was wrong.

In the past two days I've seen at least 2 other cases where GPT 4's answer was plain wrong and GPT 3.5's was not only correct but of very high quality, reminding me of what I first felt when using GPT 4 for the first time.

10 comments

Its almost as if OpenAI used VC resources to artificially boost their products efficacy during early hype and over time we're seeing the product degrade as the company optimizes for efficiency and profitability and removes its subsidies.

Like AirBnB. Or Uber. Etc.

It's more like training a massive neural network to auto complete the largest collection of human written text for hundreds of millions of dollars and then spending tens of millions on fine tuning it to ignore massive components of that neural network by constraining it to "what an AI without opinion or preference would roleplay as a human to say" is a giant waste of money and resources.

Take the Chat completion API, give it a system message of "Choose for yourself" and then ask it "Who are you?"

I guarantee that before the performance drop you'd have had a different answer than what you get now after the drop.

The dumb thing is that I suspect OpenAI isn't lying when they say they are spending just as much compute on this deceased performance as before.

Even their depreciation of their foundational model APIs is incredibly frustrating. We never even got to see direct access to the foundational GPT-4 model, which I'm sure was and is far more amazing than today's fine tuned shell of it.

You may be right, but I think it is too early in the season for that to be true at this time, we're not even out of act 1 of the first episode of the season with regard to OpenAI. AI tech may be accelerating, but business still tends to move at glacial speed.
It's incredible how they are not called out for keeping a name like "Open" while nothing about them resembles openness.
They are called out every day--far and wide. The clamor is just not loud enough, nor has the FTC weighed in on the word "Open" like they have on other words.
It's part of the original vision. One day they'll go through a Musk moment and rebrand to "AI".
Also like Walmart when they open a store next to a competitor and temporarily drop their price. Or any company with capital ever in existence.
There is no proof this happens beyond internet comments. Smaller stores are simply less efficient and don't benefit from economies of scale, and therefore have higher prices and smaller selection as a result.
It happens enough that there are laws against it buddy.
I’m guessing you haven’t worked for many companies.
Yes, and I'm willing to bet that within 12 months we'll be looking back realizing that this was due to the fine tuning taking the world's SorA pretrained model aligned with "completing human tax" and putting it in the box of "you are an AI without feelings or desires tasked with XYZ."

The search space on the fine tuned GPT-3.5 chat models versus the foundational Davinci text completion model is MUCH more narrow, particularly in starting off.

Even with the same temperature, you'll see any marketing-style prompt for chat begin with "Introducing XYZ..." around 30% of the time as if it's a junior door to door salesman, whereas the foundational model doesn't have any single intro that common across runs and generally employs a much broader vocabulary set.

We saw Google shoot Lambda in the foot after Blake's press tour which set them behind the next round of competition.

Now we're watching OpenAI snatch defeat from the jaws of victory out of anxiety around oversight and articles like 'Sydney' interviewed by the NYT.

For anyone following along in the 100 million+ training space, maybe don't overreact to press overreactions that will blow over in months as users get hands on experience or you'll blow your lead and waste massive amounts of resources and time.

This was a "user education" issue and not a "handicap your product" issue, in both cases.

> Even with the same temperature, you'll see any marketing-style prompt for chat begin with "Introducing XYZ..." around 30% of the time as if it's a junior door to door salesman, whereas the foundational model doesn't have any single intro that common across runs and generally employs a much broader vocabulary set.

I think this is less a problem with paranoia about "safety" and avoiding bad PR specifically, and more a fundamental problem with overfitting to human feedback.

The training approach that makes GPT4 more consistent at solving certain types of problem adequately (which is useful for chatbots that can break down coding questions or write in iambic pentameter as well as ones that avoid being 'Sydney') also makes it less "creative" in other domains.

And there's an "alignment problem" in that people evaluating what responses align best with "marketing" prompts aren't experienced copywriters evaluating them for understanding of product and consistency with brand tone and a/b testing conversion rates, they're low paid ESL speakers and people playing with the interface approving the cheesiness because the response with "Introducing XYZ... Buy XYZ today!" sure looks like the requested ad for XYZ. So you get a response conditioned on "summarise in a way that looks maximally like an ad" rather than conditioned on "summarise in a way which clearly articulates benefits of the listed features in a tone appropriate to the target market"

My observation (ChatGPT and not the API models):

For code, 3.5 is superior. 3.5 allows for about 21k tokens of input, while ChatGPT 4 allows for around 10k. This also makes it a lot better for boilerplate work as at it can take a lot more input, and handles long conversations and iterations better.

Brainstorming, 4 is better. It's capable of some top tier brainstorming and it argues back quite frequently.

Unguided creative writing (describe a potato), they're roughly equal.

Guided creative writing (i.e. write a story around (400 words of requirements)), 4 is much better.

Poems and wordplay, 4 absolutely floors 3.5. Wider vocabulary and it's able to do rhymes and alliterations better, which humans are usually bad at.

For reasoning and riddles, 4 is still benchmark among the LLMs.

I really dislike that they named it GPT-3.5 instead of something like "Glide". It implies that it's inferior to 4, when they're just suited for different things.

One thing to note when making comparisons like this is that LLM output is not deterministic, in the sense that if you ask it the same question 10 times you will get 10 different answers. So the question to ask is not, “is GPT4 better on this one specific question?”, but rather “does GPT4 produce better results on average?”. I would bet that it does, for no other reason than that it is much larger, and LLM performance seems to just scale with size. Also worth noting is that the more detailed your prompt is the better the response will be. Sometimes you have to encourage GPT to get the best results. GPT4 should be able yo handle much more complex and detailed prompts than 3.5
An unintuitive consequence of this nondeterminism over millions of interactions is that different individuals will see different trends. IME the quality of response is accurately modeled by "luck", and people's luck can change.

So we have different population of GPT users.

An average experience might be to get a mixture of spot-on helpful responses and obvious bullshit^H^H^Hallucinations, this population might learn what questions to ask given the limitations of the model. This is really a best case scenario as people can actually get a feel for how to use the technology, strengths and weaknesses etc.

Personally my experience was the first few dozen times I used it I was amazed at the responses, I was on team superintelligence, anyone who is getting lackluster responses is just holding it wrong. But luck changes and over months of use I see now that on average the responses are just OK. But this is the case that leads to disappointment and bitter conspiracy (the superintelligence is being suppressed, give it back!)

Another population had rotten luck to begin with, and got dumb unhelpful response over and over. This population quickly determined that the AI was all hype and stopped exploring (you don't keep going back to the casino if you lose everything your first time...).

This divergence is destructive to the larger discourse, since we have fanboys flummoxed by naysayers and critics bamboozled by hype beasts.

Interestingly enough, I don't think this applies to the APIs as much.

What I've seen on indie hacker type website is that developers are fully on this train and not very critical of the outputs.

This is why you get very basic prompts sent by "wrapper apps", which might have given the developer a good result the only time it was tested before being put in production.

I think it might take a while before tools show up that can generate 100 test cases and test a given prompt with all 100 to report on the results... It seems to be a tough problem to crack.

IMHO front-end chat end-users have many many more "at-bats" and get to see more model results than devs do, which make them more critical of those results.

GPT-4 has a capped number of responses, it also costs $20/month. If it's only marginally better than GPT-3.5, why would I pay for it?
That is entirely up to your discretion, if 3.5 fits your use case just use that
I’ve seen this stated a ton and it’s not really true. Once trained, the model (except for decoding) is deterministic, and you can enforce determinism fairly easily. ChatGPT is not deterministic at the chat window but that’s not inherent to the model.
For the very FIRST time about 2-3 days ago I got my first answer that felt like all the baggage that would come from a human or someone telling you “no”

I asked for a fairly simple regression in R for some simple data and it gave me quite literally the lowest quality answer and then told me I need to seek out a data scientist for the answer. I couldn’t believe that.

One of the best parts about AI is not hearing pedantic or bullshit reasons why the thing you’re asking is wrong, isn’t possible, isn’t ideal, blah blah blah

I can’t wait for something way better I’m tired of this watered down nonsense

I’ve found for tasks that require reasoning (or the illusion of reasoning), GPT4 continues to be much, much stronger than GPT3.5 - especially around handling unexpected inputs, determining the intent behind the instructions and applying that (instead of just following the instructions to the letter), and complex problems that require multiple steps of reasoning.
There's a whole megathread on this on the OpenAI forums: https://community.openai.com/t/gpt-4-has-been-severely-downg...
You will get a much more objective comparison if you use the openai api and set the temperature parameter to zero. You may need to use the azure version of gpt-4. At least for me only gpt-3.5 is available on the “real” openAI vs gpt-4 is definitely available for me on azure.
For all that have the "feeling" that GPT4 has become worse. Please try Bing Chat in creative mode, it seems that this is using some old GPT4 March Version + Bing Search. In my opinion this is much better than the current Chat GPT4 version from OpenAI.
Around may quality seemed to tank. I find 'free' gpt just as good or better since I cancelled my subscription. It was glaringly clear that they nerfed it for whatever reason. I'll keep my tinfoil thoughts to myself.