| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by semi-extrinsic 479 days ago

So what we need is something like a versioned crowdsourced coding LLM eval dataset.

Every quarter, you have a couple thousand volunteers provide 2 GitHub issues from the past 3 months, which are nontrivial to resolve, and where there exists strong test cases. Each volunteer then cross-checks 2 issues from other volunteers. The volunteers get 1 month free subscription to some AI service in return.

This dataset is then published as SWE-UberBench-2025-02 or something. People can then only evaluate their coding LLM on datasets published after their training period.

3 comments

delusional 479 days ago

And why would these "couple of thousand volunteers" help with this?

link

rsynnott 479 days ago

And how would you ensure that all of them were really volunteers and not colluding with the vendors? Like, tech companies cheating on benchmarks is an old, old story (personal favourite: in the dark ages, before 3D acceleration, some graphics card drivers, on detecting a 2D acceleration benchmark, would _simply draw the wrong thing_), and I wouldn’t trust at least three of the major players as far as I could throw them.

link

delusional 479 days ago

I'm pretty sure my bios still contains an option to "improve performance of 3dmark 8" or something similar.

link

nitwit005 479 days ago

If you know some way to get people to volunteer millions of dollars of free labor, there are better uses of their time than evaluating LLMs.

link

SR2Z 479 days ago

Right, so that AI companies can freely throw this significantly more valuable training data into a model and then turn around and advocate for clamping down on the freedom of models.

link