Multiple choice tests, LM Eval (e.g. have GPT-4 rate an answer, or use M-of-N GPT-4 ratings as pass/fail), perplexity (i.e. how accurately can it reproduce a corpus that it was trained on).
Lots of ways to evaluate without humans. Most (nearly all) LLM benchmarks are fully automated, without any humans involved.
Lots of ways to evaluate without humans. Most (nearly all) LLM benchmarks are fully automated, without any humans involved.