| Here are some benchmarks, excellent to see that an open model is approaching (and in some areas surpassing) GPT-3.5! AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions. - Llama 1 (llama-65b): 57.6 - LLama 2 (llama-2-70b-chat-hf): 64.6 - GPT-3.5: 85.2 - GPT-4: 96.3 HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models. - Llama 1: 84.3 - LLama 2: 85.9 - GPT-3.5: 85.3 - GPT-4: 95.3 MMLU (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. - Llama 1: 63.4 - LLama 2: 63.9 - GPT-3.5: 70.0 - GPT-4: 86.4 TruthfulQA (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples. - Llama 1: 43.0 - LLama 2: 52.8 - GPT-3.5: 47.0 - GPT-4: 59.0 [0] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...
[1] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb... |