|
|
|
|
|
by IshKebab
2428 days ago
|
|
Regarding the type of errors, it seems like the benchmark should be able to take that into account. That is, get a load of humans to do the task on the same specific examples, then for each example you know how hard it is, and what acceptable answers are (I bet a lot of the ground truth is wrong or ambiguous). Then you can benchmark your AI but penalise it more heavily for getting things wrong that are obvious to a human. |
|