Hacker News new | ask | show | jobs
by gillesjacobs 818 days ago
The yes-or-no reference answer test is a really bad way to go about this. Maybe take a note out of RAGAS evaluation templates and use an LLM to iteratively summarise the nuanced category.