| Hi! I'm Tom, a machine learning engineer at the nonprofit research institute Epoch AI [0]. I've been working on building infrastructure to: * run LLM evaluations systematically and at scale * share the data with the public in a rigorous and transparent way We use the UK government's Inspect [1] library to run the evaluations. As soon as I saw this news on HN, I evaluated Mistral Small 3 on MATH [2] level 5 (hardest subset, 1,324 questions). I get an accuracy of 0.45 (± 0.011). We sample the LLM 8 times for each question, which lets us obtain less noisy estimates of mean accuracy, and measure the consistency of the LLM's answers. The 1,324*8=10,584 samples represent 8.5M tokens (2M in, 6.5M out). You can see the full transcripts here in Inspect’s interactive interface: https://epoch.ai/inspect-viewer/484131e0/viewer?log_file=htt... Note that MATH is a different benchmark from the MathInstruct [3] mentioned in the OP. It's still early days for Epoch AI's benchmarking work. I'm developing a systematic database of evaluations run directly by us (so we can share the full details transparently), which we hope to release very soon. [0]: https://epoch.ai/ [1]: https://github.com/UKGovernmentBEIS/inspect_ai [2]: https://arxiv.org/abs/2103.03874 [3]: https://huggingface.co/datasets/TIGER-Lab/MathInstruct |
One question i have regarding evals is, what sampling temperature and/or method do you use? As far as i understand temperature/ method can impact model output alot. Would love to here you're thoughts on how these different settings of the same model can impact output and how to go about evaluating models when its not clear how to use the to their fullest