|
|
|
|
|
by scoresmoke
977 days ago
|
|
Thank you! I excluded the coding tasks as most annotators don't possess this expertise. I trust them in comparing pairs of dissimilar model outputs that don't require any specific skill but commonsense reasoning. The only manual analysis was when I checked the passed/failed prompts of the top-performing model. |
|