Hacker News new | ask | show | jobs
by tux3 584 days ago
Although the answer isn't sent, so it would have to be a very deliberate effort to fish those out of the API chatter and find the right domain expert with 4-10 hours to spend on cracking it

Just letting the AI train on its own wrong output wouldn't help. The benchmark already gives them lots of time for trial and error.

1 comments

Why do people still insist that this is unlikely? Like assuming that the company that payed 15M for chat.com does not have some spare change to pay some graduate students/postdocs to solve some math problems. The publicity of solving such benchmark would definitely raise the valuation so it would 100% be worth it for them...
Any benchmark which isn't dynamically generated is useless for that very reason.
Simple: I highly doubt they're willing to risk a scandal that would further tarnish their brand. It's still reeling from last year's drama, in addition to a spate of high-profile departures this year. Not to mention a few articles with insider sources that aren't exactly flattering.
I doubt it would be seen as scandal. They can simply generate training data for these questions just like how they generate for other problems. Only difference is probably pay rate is much higher for this kind training data than most other areas.
You’re not thinking about the other side of the equation. If they win (becoming the first to excel at the benchmark), they potentially make billions. If they lose, they’ll be relegated to the dustbin of LLM history. Since there is an existential threat to the brand, there is almost nothing that isn’t worth risking to win. Risking a scandal to avoid irrelevance is an easy asymmetrical bet. Of course they would take the risk.
Okay, let's assume what you say ends up being true. They effectively cheat, then raise some large fundraising round predicated on those results.

Two months later there's a bombshell exposé detailing insider reports of how they cheated the test by cooking their training data using an army of PhDs to hand-solve. Shame.

At a minimum investor confidence goes down the drain, if it doesn't trigger lawsuits from their investors. Then you're looking at maybe another CEO ouster fiasco with a crisis of faith across their workforce. That workforce might be loyal now, but that's because their RSUs are worth something and not tainted by fraud allegations.

If you're right, I suppose it really depends on how well they could hide it via layers of indirection and compartmentalization, and how hard they could spin it. I don't really have high hopes for that given the number of folks there talking to the press lately.

Parallel construction

Doesnt cause too much scandal lol