|
Did you even look at the article? Evaluation Benchmarks Our evaluation encompasses three primary categories of benchmarks, each designed to assess distinct capabilities of the model: • Language Understanding and Reasoning: Hellaswag [121], ARC-Challenge [14], Winogrande [83], MMLU [36],
TriviaQA [47], MMLU-Redux [26], MMLU-Pro [103], GPQA-Diamond [82], BBH [94], and [105]. • Code Generation: LiveCodeBench v6 4
[44], EvalPlus [60]. • Math & Reasoning: AIME 2025, MATH 500, HMMT 2025, PolyMath-en. • Long-context: MRCR 5
, RULER [38], Frames [52], HELMET-ICL [118], RepoQA [61], Long Code Arena [13]
and LongBench v2 [6]. • Chinese Language Understanding and Reasoning: C-Eval [43], and CMMLU [55]. |