Hacker News new | ask | show | jobs
by bluecoconut 637 days ago
oh, they do talk about it

  On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.
showing that as they increase the k of ensemble, they can continue to get it higher. All the way up to 93% when using 1000 samples.
1 comments

I think I'd be curious to know, if the size of ensemble is another scaling dimension for compute, alongside the "thinking time".