Hacker News new | ask | show | jobs
by mquander 1103 days ago
> In a way it makes perfect sense that gpt4 can score 100% on a test gpt4 also grades.

Even this is overstating it, because for each question, GPT-4 is considered to get it "correct" if, across the (18?) trials with various prompts, it ever produces one single answer that GPT-4 then, for whatever reason, accepts. That's not getting "100%" on a test.

1 comments

In the paper, they at least claimed to manually verify the correct answers.
I just looked again and I didn't see that claim, can you verify? https://arxiv.org/pdf/2306.08997.pdf

If as per the linked critique, some of the questions in the test set were basically nonsense, then clearly they couldn't have manually verified all the answers or they would have noticed that.

>We then process the data by manually correcting each question and answer to ensure quality and correctness

Section 2.1

Then the github repo also has wording around this:

> We double-verify manually that the grading of the test set is correct. https://github.com/idrori/MITQ/blob/main/index.html#L552

I agree it looks like this may not have actually been done given some of the questions and answers in the dataset.

Then - having not read the paper - what is the point of the automated grading?
To not spend time manually grading obviously incorrect ones (i.e. only grading 1/18 of them).
Got it!