| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kostaj 21 days ago
	This paper covers only the disagreement between models and established only the floor of the error, based on the disagreement, but not which model is better. Planning to follow up with another study to benchmark against human-labelled verdicts still using a corpus that the models have not seen during training.

1 comments

aspenmartin 21 days ago

You also need to involve better measures of agreement that are standard in the literature like krippendorfs alpha with ordinal metric. So many footguns in this methodology

link