Hacker News new | ask | show | jobs
by habitue 985 days ago
Should be evaluating each prompt multiple times to see how much variance in the scores there are. Even gpt-4 grading gpt-4 should probably be done several times