| HN Mirror

That makes a lot of sense! I do see value there. I guess it's the context of the class, but biased to the content which LLMs at test-time are bad at understanding/marking (e.g. the items/subjects in MMLU the LLMs still fail at).

As for the structure approach, which interests me - How can you find/note down that information, which you reference in your second paragraph?

I agree it is valuable. In times past, just identifying wrong answers (with Multiple Choice Questions / distractors) could give you great insight into misconceptions in a class (e.g. and that could fill the "Things definitely wrong to mention").

But how do you without a human expert work out "what should the answer have?", in a way that doesn't knock out answers that are left-field but genuine/synthezizing?

My experience of marking exam papers in STEM (only a few courses mind, maybe 300 papers) is that markers are wildly different. Some generally give every pupil +5% higher grades than a more critical neighbour; some are "absolutist" to the course content (either as they learned it, or it is currently taught to the student), and more than would ever admit will freely move scores up 10-20% if the students answer simply novelly interests them, so long as it's not outright provably already wrong.

The available tools to remark and to get 1:1 marking feedback (as offered in Germany after exam season) allow rectifying and smoothing these differences, but in some ways, they are genuine. I just don't know how you can codify it. Even with human markers, there is so much disagreement. I doubt that making 3 different models mark with the same criteria and averaging results/synthesizing a "majority judgement" summarization would work either. It's too hard to make the models care or identify novelty relative to the other answers you've seen that day for the question, if that makes sense.