Hacker News new | ask | show | jobs
by gsandahl 310 days ago
Most of the tasks have assessed with ground truth, occasionally helped with an LLM as a judge to assess the answer if the answer is a sentence and not an exact result.

Example: Given a long travel journal How many cities does the author mention? GPT-5: 12 Expected: 17