| HN Mirror

Wow. That's an impressive result, though we definitely need some more details on how it was achieved.

What techniques were used? He references scaling up test-time compute, so I have to assume they threw a boatload of money at this. I've heard talk of running models in parallel and comparing results - if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.

If this is legit, then I really want to know what tools were used and how the model used them.