Hacker News new | ask | show | jobs
by deckar01 864 days ago
Evaluate is referring to measuring the accuracy of a model on a standard dataset for the purpose of comparing model performance. AKA benchmark.

https://rentruewang.github.io/bocoel/research/

1 comments

Right I guess I am not familiar how automated Benchmarks for LLM work. I assumed to decide if an LLM answer was good required Human Evaluation.
Multiple choice tests, LM Eval (e.g. have GPT-4 rate an answer, or use M-of-N GPT-4 ratings as pass/fail), perplexity (i.e. how accurately can it reproduce a corpus that it was trained on).

Lots of ways to evaluate without humans. Most (nearly all) LLM benchmarks are fully automated, without any humans involved.