| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mrothroc 48 days ago
	The biggest strength of "would this get merged" is that it is actually a compound property: does it work, does it match convention, is it maintainable. And of course: does the reviewer actually want to take ownership of it. A single score has to average across these disjoint axes, which is probably why you needed so many rubrics per problem. What is the shape of the failures? When a model loses points, do they cluster on correctness? On convention? I run an autonomous pipeline and I can handle model shortcomings, but this kind of detail tells me what I need to shore up.