Hacker News new | ask | show | jobs
The most widely used benchmarks for evaluating LLMs
1 points by kavaivaleri 801 days ago
Commonsense Reasoning - HellaSwag - Winogrande - PIQA - SIQA - OpenBookQA - ARC - CommonsenseQA

Logical Reasoning - MMLU - BBHard

Mathematical Reasoning - GSM-8K - MATH - MGSM - DROP

Code Generation - HumanEval - MBPP

World Knowledge & QA - NaturalQuestions - TriviaQA - MMMU - TruthfulQA

I collected their descriptions and links to their original papers here: https://www.turingpost.com/p/llm-benchmarks

1 comments

I've never been able to click on a Turingpost link, they all give an SSL error...