Hacker News new | ask | show | jobs
by underanalyzer 1100 days ago
Great analysis, props to these students for taking the time to challenge such a sensational headline. In the conclusion they mention my biggest problem with the paper which is that it appears gpt4 grades the answers as well (see section 2.6 "Automatic Grading").

In a way it makes perfect sense that gpt4 can score 100% on a test gpt4 also grades. To be clear the grading gpt4 has the answers so it does have more information but it still might overlook important subtleties in how the real answer differs from the generated answer due to it's own failure to understand the material.

5 comments

> In a way it makes perfect sense that gpt4 can score 100% on a test gpt4 also grades.

Even this is overstating it, because for each question, GPT-4 is considered to get it "correct" if, across the (18?) trials with various prompts, it ever produces one single answer that GPT-4 then, for whatever reason, accepts. That's not getting "100%" on a test.

In the paper, they at least claimed to manually verify the correct answers.
I just looked again and I didn't see that claim, can you verify? https://arxiv.org/pdf/2306.08997.pdf

If as per the linked critique, some of the questions in the test set were basically nonsense, then clearly they couldn't have manually verified all the answers or they would have noticed that.

>We then process the data by manually correcting each question and answer to ensure quality and correctness

Section 2.1

Then the github repo also has wording around this:

> We double-verify manually that the grading of the test set is correct. https://github.com/idrori/MITQ/blob/main/index.html#L552

I agree it looks like this may not have actually been done given some of the questions and answers in the dataset.

Then - having not read the paper - what is the point of the automated grading?
To not spend time manually grading obviously incorrect ones (i.e. only grading 1/18 of them).
Got it!
If people haven't seen it UT Prof Scott Aaronson had GPT4 take his Intro Quantum final exam and had his TA grade it. It made some mistakes, but did surprisingly well with a "B". He even had it argue for a better grade on a problem it did poorly on.

Of course this was back in April when you could still get the pure unadulterated GPT4 and they hadn't cut it down with baby laxative for the noobs.

https://scottaaronson.blog/?p=7209

See the comment from Ose "Comment #199 April 17th, 2023 at 6:53 am" at the bottom of that blog post...
It literally did not change. Not one bit. Please, if you're reading this, speak up when people say this. It's a fundamental misunderstanding, there's so much chatter around AI, not much info, and the SnR is getting worse
I’ve seen the recent statement by someone at OpenAI but whatever weasel words they use, it did change.

The modified cabbage-goat-lion problem [1] that GPT4 always failed to solve, it now gets it right. I’ve seen enough people run it in enough variations [2] before to know that it absolutely did change.

Maybe they didn’t “change” as in train anything, but it’s definitely been RHLFed and it’s impacting the results.

[1] https://news.ycombinator.com/item?id=35155467

[2] anecdata: dozens of people, hundreds of times total

I attribute this to two things:

1. People have become more accustomed to the limits of GPT-4, similar to the Google effect. At first they were astounded, now they're starting to see it's limits

2. Enabling Plugins (or even small tweaks to the ChatGPT context like adding today's date) pollute the prompt, giving more directed/deterministic responses

The API, as far as I can tell, is exactly the same as it was when I first had access (which has been confirmed by OpenAI folks on Twitter [0])

[0] https://twitter.com/jeffintime/status/1663759913678700544

In my experience with Bing Chat, in addition to what you say, there is also some A/B testing going on as well.
"It literally did not change. Not one bit."

How do you know?

Even if the base model didn't change, that doesn't mean they didn't fine tune it in some way over time. They also might be passing its answers through some other AI or using some other techniques to filter, censor, and/or modify the answers in some way before returning them to the user.

I don't know how anyone could confidently say what they're doing unless they work at OpenAI.

Someone who works at OpenAI said so two weeks ago
Then again, can we trust that person? It's not like they didn't have conflict of interest to make that claim.
Yes, it’s turtles all the way down
Nice try, ClosedAI. Then how do you explain this?

https://news.ycombinator.com/item?id=36348867

Well, I had hoped the sarcastic comparison to cut heroin would make it clear.

No, I don't think there's much change at all to GPT-4 (at the API level) and probably not that much at the pre/post language detection and sanitation for apparently psychotic responses.

You should take a look at this video. He is a researcher at Microsoft and had accès to private version of ChatGPT. He literally claims that ChatGPT 4 is not as good as before. His talk actually demonstrates the different evolutions.

https://youtu.be/qbIk7-JPB2c

If you are referring to that social media post by an OpenAI employee saying it hasn’t changed, they were specifically referring to the API. iirc, the same employee explicitly stated the Web UI version changes quite regularly. Someone correct me with the link if I’m wrong, I don’t have it handy.
This "GPT4 evaluating LLMs" problem is not limited to this case. I don't know why exactly but everyone seems to have accepted the evaluation of other LLM outputs using GPT4. GPT-4 at this point is being regarded as "ground-truth" with each passing day.

Couple this with the reliance on crowd-sourcing to create evaluation datasets and heavy use of GPT3.5 and GPT4 by MTurk workers, you have a big fat feed-forward process benefiting only one party: OpenAI.

The Internet we know is dead - this is a fact. I think OpenAI exactly knew how this would play out. Reddit, Twitter and the like are awakening just now - to find that they're basically powerless against this wave of distorted future standards.

When sufficiently proven to pass every existing test on Earth, every institution would be so reliant on producing work with GPT that we won't have a "%100 handmade exam" anymore. No problem will be left for GPT to be tackled with.

>> I don't know why exactly but everyone seems to have accepted the evaluation of other LLM outputs using GPT4. GPT-4 at this point is being regarded as "ground-truth" with each passing day.

Why? Because machine learning is not a scientific field. That means anyone can say and do whatever they like and there's no way to tell them that what they're doing is wrong. At this point, machine learning research is like the social sciences: a house of cards, unfalsifiable and unreproducible research built on top of other unfalsifiable and unreproducible research. People simply choose whatever approach they like, cite whatever result they like, because they like the result, not because there's any reason to trust it.

Let me not bitch again about the complete lack of anything like objective measures of success in language modelling, in particular. There have been no good metrics, no meaningful benchmarks, for many decades now, in NLP as a whole, but in language generation even more so. This is taught at students in NLP courses (our tutors discussed it in my MSc course) there is scholarship on it, there is a constant chorus of "we have no idea what we're doing" but nothing changes. It's too much hard work to try and find good metrics, build good benchmarks. It's much easier to put a paper on arxiv that shows SOTA results (0.01 more than the best system compared to!). And so the house of cards rises ever towards the sky.

Here's a recent paper that points out the sorry state of Natural Language Understanding (NLU) benchmarking:

What Will it Take to Fix Benchmarking in Natural Language Understanding?

https://aclanthology.org/2021.naacl-main.385/

There are many more, going back years. There are studies of how top-notch performance on NLU benchmarks is reduced to dust when the statistical regularities that models learn to overfit to in test datasets are removed. Nobody. fucking. cares. You can take your science and go home, we're making billion$$$ here!

I would have said machine learning is more like materials science, but you are on the right track.

As you increase the number of bits you are trying to comprehend, you move from quantum physics to chemistry to material science to biology to social science.

At certain points, the methods and reproducibility become somewhat of a dark art. I have experience that in my field of materials science.

Because these models are using billions or trillions of random number generators in their probability chains, it starts looking more like the harder hard sciences, it gets very difficult to track and understand what is important.

I think machine learning will be easier to comprehend than social sciences, so I wouldn't put it that high. It will be something between materials science and biology levels of difficulty in understanding.

Yes, I have similar concerns. These models regurgitate previously seen strings, previous benchmarks included. When you try to evaluate their sheer ability to reason on the text, however, they perform poorly. (Our experiments with GPT-3 are here: https://doi.org/10.5220/0012007500003470)
> The Internet we know is dead - this is a fact. I think OpenAI exactly knew how this would play out.

If OpenAI ceased to be – probably for some legislative reason –, would the problems go away?

The damage will have been done so I don’t think so.
> but it still might overlook important subtleties

If there's one thing we can be certain of, it's that LLMs often overlooks important subtleties.

Can't believe they used GPT4 to also evaluate the results. I mean, we wouldn't trust a student to grade their own exam even when given the right answers to grade with.

I noticed that when I read the paper. I know it's hard to scale but I'd want to see competent TAs doing the grading. I also found the distribution of courses a bit odd. Some of it might be just individual samples but intro courses I'd expect to be pretty cookie cutter (for GPT) were fairly far down the list and things I'd expect to be really challenging had relatively good results.
Can attest that the distribution is odd from the test set that we sampled.

We've already run the compute to run the zero-shot GPT model on all of the datapoints in the provided test set. We're going through the process now of grading them manually (our whole fraternity is chipping in!) and should have the results out relatively soon.

I can say that, so far, it's not looking good for that 90% correct zero-shot claim either.

Since you are here, when I was reading the paper I wondered -- when they show the "zero-shot solve rates", does that mean that they are basically running the same experiment code, but without the prompts that call `few_shot_response` (i.e. they are still trying each question with every expert prefix, and every critique?) It wasn't clear to me at a glance.