Hacker News new | ask | show | jobs
by semi-extrinsic 479 days ago
So what we need is something like a versioned crowdsourced coding LLM eval dataset.

Every quarter, you have a couple thousand volunteers provide 2 GitHub issues from the past 3 months, which are nontrivial to resolve, and where there exists strong test cases. Each volunteer then cross-checks 2 issues from other volunteers. The volunteers get 1 month free subscription to some AI service in return.

This dataset is then published as SWE-UberBench-2025-02 or something. People can then only evaluate their coding LLM on datasets published after their training period.

3 comments

And why would these "couple of thousand volunteers" help with this?
And how would you ensure that all of them were really volunteers and not colluding with the vendors? Like, tech companies cheating on benchmarks is an old, old story (personal favourite: in the dark ages, before 3D acceleration, some graphics card drivers, on detecting a 2D acceleration benchmark, would _simply draw the wrong thing_), and I wouldn’t trust at least three of the major players as far as I could throw them.
I'm pretty sure my bios still contains an option to "improve performance of 3dmark 8" or something similar.
If you know some way to get people to volunteer millions of dollars of free labor, there are better uses of their time than evaluating LLMs.
Right, so that AI companies can freely throw this significantly more valuable training data into a model and then turn around and advocate for clamping down on the freedom of models.