Hacker News new | ask | show | jobs
by maxrmk 988 days ago
This is really cool, nice work. Did you try out any of the grading yourself to compare it to the contractors you used? One thing I've found, especially for coding questions is that models can produce an answer that _looks_ great, but then turns out to use libraries or methods that don't exist. And that human graders tend to rate these highly since they don't actually run the code.
1 comments

Thank you! I excluded the coding tasks as most annotators don't possess this expertise. I trust them in comparing pairs of dissimilar model outputs that don't require any specific skill but commonsense reasoning.

The only manual analysis was when I checked the passed/failed prompts of the top-performing model.