Y
Hacker News
new
|
ask
|
show
|
jobs
by
habitue
985 days ago
Should be evaluating each prompt multiple times to see how much variance in the scores there are. Even gpt-4 grading gpt-4 should probably be done several times