|
|
|
|
|
by bcaine
1621 days ago
|
|
Not to pour too much cold water on this, but the claim of 100% accuracy has a huge caveat. In the paper (Page 4) they state: Interaction. The original question may not be a prompt that synthesizes a program whose execution results in the correct answer. In addition, the answer may require multiple steps with clear plots or other modalities. We therefore may interactively prompt Codex until reaching the correct answer or visualizations, making the minimum necessary changes from the original question Which to me basically sounds like they had a human in the loop (that knows how to solve these math problems) that kept changing the question until it gave the correct answer. They do measure the distance (using a sentence embedding model) of the original question to the one that yielded the correct answer, but that feels a bit contrived to me. Nevertheless, its still really cool that the correct answer is indeed inside the model. |
|
This makes the "at scale" claim in the abstract clearly false IMO. Any AI system that requires that much human intervention is not scalable. When they have a second AI to produce the prompts automatically from the original questions, then they can claim to have achieved scalability.
But even without that, a system like this can still certainly be useful. And I expect rapid progress in the next few years.