Hacker News new | ask | show | jobs
by mrothroc 3 days ago
The biggest strength of "would this get merged" is that it is actually a compound property: does it work, does it match convention, is it maintainable. And of course: does the reviewer actually want to take ownership of it. A single score has to average across these disjoint axes, which is probably why you needed so many rubrics per problem.

What is the shape of the failures? When a model loses points, do they cluster on correctness? On convention? I run an autonomous pipeline and I can handle model shortcomings, but this kind of detail tells me what I need to shore up.