Hacker News new | ask | show | jobs
by sebzim4500 584 days ago
>Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.

Or they know the ancient technique of training on the test set. I know most of the questions are kept secret, but they are being regularly sent over the API to every LLM provider.

2 comments

Although the answer isn't sent, so it would have to be a very deliberate effort to fish those out of the API chatter and find the right domain expert with 4-10 hours to spend on cracking it

Just letting the AI train on its own wrong output wouldn't help. The benchmark already gives them lots of time for trial and error.

Why do people still insist that this is unlikely? Like assuming that the company that payed 15M for chat.com does not have some spare change to pay some graduate students/postdocs to solve some math problems. The publicity of solving such benchmark would definitely raise the valuation so it would 100% be worth it for them...
Any benchmark which isn't dynamically generated is useless for that very reason.
Simple: I highly doubt they're willing to risk a scandal that would further tarnish their brand. It's still reeling from last year's drama, in addition to a spate of high-profile departures this year. Not to mention a few articles with insider sources that aren't exactly flattering.
I doubt it would be seen as scandal. They can simply generate training data for these questions just like how they generate for other problems. Only difference is probably pay rate is much higher for this kind training data than most other areas.
You’re not thinking about the other side of the equation. If they win (becoming the first to excel at the benchmark), they potentially make billions. If they lose, they’ll be relegated to the dustbin of LLM history. Since there is an existential threat to the brand, there is almost nothing that isn’t worth risking to win. Risking a scandal to avoid irrelevance is an easy asymmetrical bet. Of course they would take the risk.
Okay, let's assume what you say ends up being true. They effectively cheat, then raise some large fundraising round predicated on those results.

Two months later there's a bombshell exposé detailing insider reports of how they cheated the test by cooking their training data using an army of PhDs to hand-solve. Shame.

At a minimum investor confidence goes down the drain, if it doesn't trigger lawsuits from their investors. Then you're looking at maybe another CEO ouster fiasco with a crisis of faith across their workforce. That workforce might be loyal now, but that's because their RSUs are worth something and not tainted by fraud allegations.

If you're right, I suppose it really depends on how well they could hide it via layers of indirection and compartmentalization, and how hard they could spin it. I don't really have high hopes for that given the number of folks there talking to the press lately.

Parallel construction

Doesnt cause too much scandal lol

Of course lol. How come e.g. o1 scores so high on these reasoning and math and IMO benchmarks and then fails every simple question I ask of it? The answer is training on the test set.