| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mquander 1103 days ago
	> In a way it makes perfect sense that gpt4 can score 100% on a test gpt4 also grades. Even this is overstating it, because for each question, GPT-4 is considered to get it "correct" if, across the (18?) trials with various prompts, it ever produces one single answer that GPT-4 then, for whatever reason, accepts. That's not getting "100%" on a test.

1 comments

aeternum 1102 days ago

In the paper, they at least claimed to manually verify the correct answers.

link

mquander 1102 days ago

I just looked again and I didn't see that claim, can you verify? https://arxiv.org/pdf/2306.08997.pdf

If as per the linked critique, some of the questions in the test set were basically nonsense, then clearly they couldn't have manually verified all the answers or they would have noticed that.

link

aeternum 1102 days ago

>We then process the data by manually correcting each question and answer to ensure quality and correctness

Section 2.1

Then the github repo also has wording around this:

> We double-verify manually that the grading of the test set is correct. https://github.com/idrori/MITQ/blob/main/index.html#L552

I agree it looks like this may not have actually been done given some of the questions and answers in the dataset.

link

sanderjd 1102 days ago

Then - having not read the paper - what is the point of the automated grading?

link

riffraff 1102 days ago

To not spend time manually grading obviously incorrect ones (i.e. only grading 1/18 of them).

link

sanderjd 1102 days ago

Got it!

link