Hacker News new | ask | show | jobs
by lufenialif2 329 days ago
Still no information on the amount of compute needed; would be interested to see a breakdown from Google or OpenAI on what it took to achieve this feat.

Something that was hotly debated in the thread with OpenAI's results:

"We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions."

it seems that the answer to whether or not a general model could perform such a feat is that the models were trained specifically on IMO problems, which is what a number of folks expected.

Doesn't diminish the result, but doesn't seem too different from classical ML techniques if quality of data in = quality of data out.

5 comments

Ok but when reported by mass media, which never used SI units and instead uses units like libraries of Congress, or elephants, what kind of unit should media use to compare computational energy of ai vs children?
If the models that got a gold medal are anything like those used on ARC-AGI, then you can bet they wrote an insane amount of text trying to reason their ways through these problems. Like, several bookshelves worth of writings.

So funnily enough, "the AI wrote x times the library of Congress to get there" is good enough of a comparison.

Dollars of compute at market rate is what I'd like to see, to check whether calling this tool would cost $100 or $100,000
4.5 hours × 2 "days", 100 Wats including support system.

I'm not sure how to implement the "no calculator" rule :) but for this kind of problems it's not critical.

Total = 900Wh = 3.24MJ

100 watts seems very low. A single Nvidia GeForce RTX 5090 is rated at ~600 watts. Probably they are using many GPUs/TPUs in parallel.
I forgot to explain in my comment, but my calculation is for humans.

If the computer uses ~600W, let's give it 45+45 minutes and we are even :) If they want to use many GPU ...

Convert libraries, elephants, etc into SI of course! Otherwise, they aren't really comparable...
Kilocalories. A unit of energy that equals 4184 Joules.
Human IMO contestants are also trained specifically on IMO problems.
They can train it n “Crux Mathematicorum” and similar journals, which are collections of “interesting” problems and their solutions.

https://cms.math.ca/publications/crux

Some unofficial comparison with costs of public models (performing worse): https://matharena.ai/imo/

So the real cost is something much more.

>it seems that the answer to whether or not a general model could perform such a feat is that the models were trained specifically on IMO problems, which is what a number of folks expected.

Not sure thats exactly what that means. Its already likely the case that these models contained IMO problems and solutions from pretraining. It's possible this means they were present in the system prompt or something similar.

Does the IMO reuse problems? My understanding is that new problems are submitted each year and 6 are selected for each competition. The submitted problems are then published after the IMO has concluded. How would the training data contain unpublished, newly submitted problems?

Obviously the training data contained similar problems, because that's what every IMO participant already studies. It seems unlikely that they had access to the same problems though.

IMO doesn't reuse problems, but Terence Tao has a Mastodon post where he explains that the first five (of six) problems are generally ones where existing techniques can be leveraged to get to the answer. The sixth problem requires considerable originality. Notably, both Gemini and OpenAI's model didn't get the sixth problem. Still quite an achievement though.
Do you have another source for that? I checked his Mastodon feed and don't see any mention about the source of the questions from the IMO.

https://mathstodon.xyz/@tao

strange statement--it's not true in general for sure (3&6 typically hardest but they certainly aren't fundamentally of a different nature to other questions) this year P6 seemed to be by far the hardest though but this posthoc statement should be read cautiously
>How would the training data contain unpublished, newly submitted problems?

I don't think I or op suggested it did.

Or that they did significant retraining to boost IMO performance creating a more specialized model at the cost of general-purpose performance.