|
|
|
|
|
by mrothroc
3 days ago
|
|
The biggest strength of "would this get merged" is that it is actually a compound property: does it work, does it match convention, is it maintainable. And of course: does the reviewer actually want to take ownership of it. A single score has to average across these disjoint axes, which is probably why you needed so many rubrics per problem. What is the shape of the failures? When a model loses points, do they cluster on correctness? On convention? I run an autonomous pipeline and I can handle model shortcomings, but this kind of detail tells me what I need to shore up. |
|