Hacker News new | ask | show | jobs
by tadamcz 502 days ago
Hi! I'm Tom, a machine learning engineer at the nonprofit research institute Epoch AI [0]. I've been working on building infrastructure to:

* run LLM evaluations systematically and at scale

* share the data with the public in a rigorous and transparent way

We use the UK government's Inspect [1] library to run the evaluations.

As soon as I saw this news on HN, I evaluated Mistral Small 3 on MATH [2] level 5 (hardest subset, 1,324 questions). I get an accuracy of 0.45 (± 0.011). We sample the LLM 8 times for each question, which lets us obtain less noisy estimates of mean accuracy, and measure the consistency of the LLM's answers. The 1,324*8=10,584 samples represent 8.5M tokens (2M in, 6.5M out).

You can see the full transcripts here in Inspect’s interactive interface: https://epoch.ai/inspect-viewer/484131e0/viewer?log_file=htt...

Note that MATH is a different benchmark from the MathInstruct [3] mentioned in the OP.

It's still early days for Epoch AI's benchmarking work. I'm developing a systematic database of evaluations run directly by us (so we can share the full details transparently), which we hope to release very soon.

[0]: https://epoch.ai/

[1]: https://github.com/UKGovernmentBEIS/inspect_ai

[2]: https://arxiv.org/abs/2103.03874

[3]: https://huggingface.co/datasets/TIGER-Lab/MathInstruct

1 comments

Thanks a lot for this eval!

One question i have regarding evals is, what sampling temperature and/or method do you use? As far as i understand temperature/ method can impact model output alot. Would love to here you're thoughts on how these different settings of the same model can impact output and how to go about evaluating models when its not clear how to use the to their fullest

Generally, we'll use the API provider's defaults.

For models we run ourselves from the weights, at the moment we'd use vLLM's defaults, but this may warrant more thought and adjustment. Other things being equal, I prefer to use an AI lab's API, with settings as vanilla as possible, so that we essentially defer to them on these judgments. For example, this is why we ran this Mistral model from Mistral's API instead of from the weights.

I believe the `temperature` parameter, for example, has different implementations across architectures/models, so it's not as simple as picking a single temperature number for all models.

However, I'm curious if you have further thoughts on how we should approach this.

By the way, in the log viewer UI, for any model call, you can click on the "API" button to see the payloads that were sent. In this case, you can see that we do not send any values to Mistral for `top_p`, `temperature`, etc.