|
|
|
|
|
by shabie
667 days ago
|
|
If I understood you correctly, yes I believe the notes should cover things that are not well-understood by LLMs more than stuff we know it typically gets right. So for us these are internal concepts and how people talk about them and less so about programming syntax. Also, I've been thinking about adding a structure to the grading-notes so the variance in quality you get when asking people to leave notes becomes smaller. Yet, structure increases the burden... things like "what should the answer have", "what is definitely wrong to mention" etc. |
|
As for the structure approach, which interests me - How can you find/note down that information, which you reference in your second paragraph?
I agree it is valuable. In times past, just identifying wrong answers (with Multiple Choice Questions / distractors) could give you great insight into misconceptions in a class (e.g. and that could fill the "Things definitely wrong to mention").
But how do you without a human expert work out "what should the answer have?", in a way that doesn't knock out answers that are left-field but genuine/synthezizing?
My experience of marking exam papers in STEM (only a few courses mind, maybe 300 papers) is that markers are wildly different. Some generally give every pupil +5% higher grades than a more critical neighbour; some are "absolutist" to the course content (either as they learned it, or it is currently taught to the student), and more than would ever admit will freely move scores up 10-20% if the students answer simply novelly interests them, so long as it's not outright provably already wrong.
The available tools to remark and to get 1:1 marking feedback (as offered in Germany after exam season) allow rectifying and smoothing these differences, but in some ways, they are genuine. I just don't know how you can codify it. Even with human markers, there is so much disagreement. I doubt that making 3 different models mark with the same criteria and averaging results/synthesizing a "majority judgement" summarization would work either. It's too hard to make the models care or identify novelty relative to the other answers you've seen that day for the question, if that makes sense.