|
|
|
|
|
by parsimo2010
24 days ago
|
|
This is a great example of why prompt engineering is still relevant. Without providing definitions and examples and a well defined rubric, you’re going to see different models disagree by a level in either direction. When you get more prescriptive the models tend to agree better. I’ve experimented with AI grading for undergraduate math courses, and see basically the same thing. If you just tell the AI “grade this problem and assign a letter grade” then I’ve only seen about 30% agreement between a human assigned grade and the AI assigned grade. But over 75% agreement if you say a “match” is within one letter grade. And to get better agreement you have to spend a lot more time on the rubric- what kinds of mistakes are a big deal, what kinds of mistakes are not a big deal, how much work is required to be shown to get credit, a couple examples of each letter grade. Once you have done that, the AI gets a lot better agreement with human graders, but it is hard to know when you’ve given enough guidance for a problem. |
|