| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rvnx 210 days ago

This is a list of questions and answers that was created by different people.

The questions AND the answers are public.

If the LLM manages through reasoning OR memory to repeat back the answer then they win.

The scores represent the % of correct answers they recalled.

1 comments

tylervigen 210 days ago

That is not entirely true. At least some of these tests (like HLE and ARC) take steps to keep the evaluation set private so that LLMs can’t just memorize the answers.

You could question how well this works, but it’s not like the answers are just hanging out on the public internet.

link

slaterbug 210 days ago

Excuse my ignorance, how do these companies evaluate their models against the evaluation set without access to it?

link

ricopags 210 days ago

Cooperation with the eval admins

link