| HN Mirror

I went through this with gemini-1.5, using it to evaluate responses. Almost everything was graded 8-9/10. To get useful results I did the following. 1. Created a long few-shot prompt with many examples of human graded results. 2. Prompt it to write it's review before it's assesment. 3. Prompt it to include example quotes to justify it's assesment 4. Finally produce a numeric score.

With gemini-2 I've been able to get similar results without the few-shot prompts. Simply by prompting it to not be a sycophant, and explaining why it was important to get realistic, even hard scores, and that i expected most scores to be low, on order for the high scoring content to stand out.

In a recent test, I changed to using word scores, low, medium, high, and very high. Out of about 500 examples none scored very high. I thought that was pretty cool, as when I do find one scoring high it will stand out, and hopefully justify it's score