| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pinko 253 days ago
	From https://lastexam.ai/: "The dataset consists of 2,500 challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting." [emphasis mine] While the private questions don't seem to be included in the performance results, HLE will presumably flag any LLM that appears to have gamed its scores based on the differential performance on the private questions. Since they haven't yet, I think the scores are relatively trustworthy.

4 comments

panarky 253 days ago

The jump in ARC-AGI and MathArena suggests Google has solved the data scarcity problem for reasoning, maybe with synthetic data self-play??

This was the primary bottleneck preventing models from tackling novel scientific problems they haven't seen before.

If Gemini 3 Pro has transcended "reading the internet" (knowledge saturation), and made huge progress in "thinking about the internet" (reasoning scaling), then this is a really big deal.

link

largbae 253 days ago

How do they hold back questions in practice though? These are hosted models. To ask the question is to reveal it to the model team.

link

Bombthecat 253 days ago

They pinky swear not to store and use the prompts and data lol

link

UltraSane 253 days ago

A legally binding pinky swear LOL

link

riku_iki 253 days ago

with fineprint somewhere on page #67, that there are exceptions.

link

ashdksnndck 252 days ago

Who needs fine print when there is an SRE with access to the servers who is friends with a research director who gets paid more if the score goes up?

link

UltraSane 253 days ago

You have to trust that the LLM provider isn't copying the questions when Humanities Last Exam runs the test.

link

mapt 252 days ago

There are only eleventy trillion dollars shifting around based on the results, so nobody has any reason to lie.

link

rvnx 253 days ago

Seems difficult to believe, considering the number of people who prepare this dataset, who also work(ed) or hold shares in Google or OpenAI, etc.

link

menaerus 251 days ago

So everybody is cheating in your mind? We can't trust anything? How about taking a more balanced take: there's certainly some progress, and while the benchmark results most likely don't represent the world reality, the progress is continuous.

link