Hacker News new | ask | show | jobs
by IshKebab 2428 days ago
Regarding the type of errors, it seems like the benchmark should be able to take that into account. That is, get a load of humans to do the task on the same specific examples, then for each example you know how hard it is, and what acceptable answers are (I bet a lot of the ground truth is wrong or ambiguous).

Then you can benchmark your AI but penalise it more heavily for getting things wrong that are obvious to a human.

1 comments

That would be ideal, if money weren't a factor. Since money is a factor, I wonder what the tradeoff is between labelling each instance N more times versus just getting N times more instances labeled.