|
|
|
|
|
by MacsHeadroom
856 days ago
|
|
Multiple choice tests, LM Eval (e.g. have GPT-4 rate an answer, or use M-of-N GPT-4 ratings as pass/fail), perplexity (i.e. how accurately can it reproduce a corpus that it was trained on). Lots of ways to evaluate without humans. Most (nearly all) LLM benchmarks are fully automated, without any humans involved. |
|