| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jayd16 56 days ago

Lets just base benchmarks on bounty rankings. To bench a model, you have it look at PRs on some open source projects. It has to complete a novel task or improve a previous task but no points for just re-doing a task with an existing PR. We rank the tasks by difficulty for the benchmark post-facto, once completed.

If an AI company wants to show off, it'll have to crush some OSS PRs. If another company wants to say their model remains supreme, it'll have to complete other tasks that were left on the table.

Of course, you would only bother the OSS project with new PRs once you were actually not embarrassed by what your model did.

In this way, rankings are created from jolly combat and one-ups-manship and we get some OSS work done.

(mostly joking but it would be a fun way to do things)