Open-source, sanitized evaluation datasets for models that reason and code

When training our 70B model, we sought to accurately evaluate models for natural language understanding and reasoning abilities. Surprisingly, we found that both open and closed models achieve nearly 100% accuracy when evaluated only on unambiguous questions. We cleaned evaluation datasets to isolate true failures of reasoning from failure due to ambiguous or low-quality questions, and have open-sourced many. This includes:

• 11 sanitized and extended NLP reasoning benchmarks including ARC, GSM8K, HellaSwag, and Social IQa • An original code-focused reasoning benchmark • A new dataset of 450,000 human judgments about ambiguity in NLP questions