Hacker News new | ask | show | jobs
by scoresmoke 977 days ago
Thank you! I excluded the coding tasks as most annotators don't possess this expertise. I trust them in comparing pairs of dissimilar model outputs that don't require any specific skill but commonsense reasoning.

The only manual analysis was when I checked the passed/failed prompts of the top-performing model.