Hacker News new | ask | show | jobs
by aeternum 1100 days ago
In the paper, they at least claimed to manually verify the correct answers.
2 comments

I just looked again and I didn't see that claim, can you verify? https://arxiv.org/pdf/2306.08997.pdf

If as per the linked critique, some of the questions in the test set were basically nonsense, then clearly they couldn't have manually verified all the answers or they would have noticed that.

>We then process the data by manually correcting each question and answer to ensure quality and correctness

Section 2.1

Then the github repo also has wording around this:

> We double-verify manually that the grading of the test set is correct. https://github.com/idrori/MITQ/blob/main/index.html#L552

I agree it looks like this may not have actually been done given some of the questions and answers in the dataset.

Then - having not read the paper - what is the point of the automated grading?
To not spend time manually grading obviously incorrect ones (i.e. only grading 1/18 of them).
Got it!