|
|
|
The most widely used benchmarks for evaluating LLMs
|
|
1 points
by kavaivaleri
801 days ago
|
|
Commonsense Reasoning
- HellaSwag
- Winogrande
- PIQA
- SIQA
- OpenBookQA
- ARC
- CommonsenseQA Logical Reasoning
- MMLU
- BBHard Mathematical Reasoning
- GSM-8K
- MATH
- MGSM
- DROP Code Generation
- HumanEval
- MBPP World Knowledge & QA
- NaturalQuestions
- TriviaQA
- MMMU
- TruthfulQA I collected their descriptions and links to their original papers here: https://www.turingpost.com/p/llm-benchmarks |
|