|
|
|
|
|
by hangsi
810 days ago
|
|
I think the parent comment was referring to something else. In the paper the tasks are only completed by GPT-4V. For a valid scientific investigation, there should be a control set completed by e.g. qualified doctors. When the panel of experts does their evaluation, they should rate both sets of responses so that the difference in score can be compared in the paper. |
|
It assumes that "here are 5 doctors which are always correct". Then measures GPT's correctness against them.