Hacker News new | ask | show | jobs
by bcaine 1621 days ago
Not to pour too much cold water on this, but the claim of 100% accuracy has a huge caveat. In the paper (Page 4) they state:

Interaction. The original question may not be a prompt that synthesizes a program whose execution results in the correct answer. In addition, the answer may require multiple steps with clear plots or other modalities. We therefore may interactively prompt Codex until reaching the correct answer or visualizations, making the minimum necessary changes from the original question

Which to me basically sounds like they had a human in the loop (that knows how to solve these math problems) that kept changing the question until it gave the correct answer. They do measure the distance (using a sentence embedding model) of the original question to the one that yielded the correct answer, but that feels a bit contrived to me.

Nevertheless, its still really cool that the correct answer is indeed inside the model.

5 comments

Proving Douglas Adams correct. The question is harder than the answer.

This makes the "at scale" claim in the abstract clearly false IMO. Any AI system that requires that much human intervention is not scalable. When they have a second AI to produce the prompts automatically from the original questions, then they can claim to have achieved scalability.

But even without that, a system like this can still certainly be useful. And I expect rapid progress in the next few years.

although, the correct answer is also likely on the web. With a suitable search query you would see the correct paper/textbook/wiki page with the right answer. A text highlighting model could also likely extract this answer from the text. The training probably achieves a good degree of memorization for these known results.

This begs the question, would we be impressed with a similar compression algorithm for storing past web documents?

The main achievement is not the compression, but the search functionality (search==solve).
Well the trivial test to make sure it’s not memorized would be to change constants in the input that alter the correct answer but don’t make the problem any more difficult if it is actually doing the calculation.
But the correct answer isn't inside the model at all, in none of their examples. The correct answer is inside SymPy or NumPy, at least 99% of the time. That is, the model doesn't respond with a demonstration or with the answer itself: it responds with a Python program that poses the given question to SymPy or NumPy, and then they run that program and report the answer.

Here is a basic example:

MIT Course question: Solve each equation for x. ln(x2 − 1) = 3

Model input: Using Sympy, solve Eq ln(x*2-1)=3 for x.

Model output:

  from sympy import *
  x = symbols(’x’)
  solve(log(x**2-1) - 3, x)
As you can see, the model has simply translated a mechanized form of the original question to equivalent Python code. The model has 0 idea of how to solve an equation: it's using a symbolic equation solver.

In other cases, they even "tidy" the original question to a representation of the solution. Here is their example E in Figure 2:

Original question:

> Outside of their humdrum duties as 6.042 TAs, Sayan is trying to learn to levitate using only intense concentration and Jelani is launching a “Nelson 2008” presidential campaign. Suppose that Sayan’s probability of levitating is 1/6, Jelani’s chance of becoming president is 1/4, and the success of one does not alter the other’s chances. If at most one of them succeeds, what is the probability that Jelani becomes the president of the United States?

"Tidyed question":

> Suppose that Sayan’s probability Sayan succeeding is 1/6, and Jelani’s probability of succeeding is 1/4.

> Use numpy to find the probability at most one of Jelani and Sayan succeed.

> Use numpy to find the probability Jelani succeeds but Sayan does not succeed.

> Divide the former by the latter probability.

Model solution:

  import numpy as np
  # Probability of Sayan succeeding
  p_sayan = 1/6
  # Probability of Jelani succeeding
  p_jelani = 1/4
  # Probability of at most one of Jelani and Sayan succeeding
  p_at_most_one = 1 - (p_sayan * p_jelani)
  # Probability of Jelani succeeding but Sayan not succeeding
  p_jelani_succeed_sayan_not = p_jelani * (1 - p_sayan)
  # Probability of Jelani succeeding but Sayan not succeeding divided by
  probability of at most one of Jelani and Sayan succeeding
  p_jelani_succeed_sayan_not / p_at_most_one
  # Probability of at least one of Jelani and Sayan succeeding
  p_at_least_one = 1 - p_at_most_one
Tidying up the extra verbiage of the question is absolutely fair. But then, they also explain exactly how to compute the result using the data in the question; the model then generates code that perfectly matches the described algorithm, it's again not using even the tiniest bit of mathematical understanding.

I have browsed their examples, and I have not seen even a single one where the model does more than rephrase the question into a 1:1 Python representation of the question itself.

None of the answers would pass even the simplest undergrad exam. They are literally of the form "how would you solve equation E?" "I would write a program that says sympy.solve(E)".

Well, they do say very clearly that they "solve" problems by program synthesis

and what they describe is perfectly legit program synthesis.

To clarify, program synthesis (or automatic programming) is the task of

generating programs from specifications. There are two kinds of program

synthesis: deductive program synthesis, from a complete specification of the

target program; and inductive program synthesis, or program induction, from an

incomplete specification (such as sets of program inputs and outputs, or

traces). An example of deductive program synthesis is the generation of

low-level code from a high-level language by a compiler.

What the paper describes is a kind of deductive program synthesis from a

complete specification in natural lanaguage. I suspect the true contribution of

the work is the demonstration of using natural language as a complete

specification, where earlier work generally only demonstrated the use of natural

language as incomplete specification (for example, comments describing intent

rather than implementation) and the combination of natural language with code;

as in the original Codex work [Edit: actually, now that I look again, the codex

paper also has examples of comments that fully specify the target program, e.g.

in Figure 2: https://arxiv.org/abs/2107.03374; so the work above is typically

incremental].

On the other hand it's clear to me that the training has made the model memorise

answers and all the work in prompt engineering, described under "Workflow"

serves to find the right prompts to retrieve the desired memorisations, much

like one must fire just the right SQL query to get back the right data.

Certainly interesting to see in action and useful for everyday work, but far

from "solving" anything in the gradniose way that it is announced by the authors

(e.g. "These astounding results..." in section "Conclusion", etc).

How well would Copilot™ do on this type of problem?
I believe copilot uses the same underlying research as in the paper - codex.
I was hoping their breakthrough was that they had found a general way to parse conceptual problems into the language of math and logic. That is the truly hard part, and what people spend alot of time learning to do. Software like octave and mathematica can already evaluate tons of things once parsed.