Hacker News new | ask | show | jobs
by maxwells-daemon 1773 days ago
Look at the "math test" video.

Given the question: "Jane has 9 balloons. 6 are green and the rest are blue. How many balloons are blue?" The model outputs: "jane_balloons = 9; green_balloons = 6; blue_balloons = jane_balloons - green_balloons; print(blue_balloons)"

That seems like a good justification of a (very simple) step-by-step reasoning process!

3 comments

I wonder what would it have outputted if we would remove the “ and the rest are blue” part from the question.

Would not surprise me if an innatentive human student would answer that with the same code. After all school “trains” people to expect such challenges to be solveable. A more attenive human might say “we can’t know” or provide an upper limit to the number of potential blue balloons.

Related article: Teaching GPT-3 to Identify Nonsense

https://arr.am/2020/07/25/gpt-3-uncertainty-prompts/

chances are high that something similar was in training set, and model approximated it.
You are very likely right. The question is how far the approximation can generalise? One way to test that would be to quizz the model with slightly varied prompts. Any human who can “solve” this world problem should be reasonably expected to solve the same problem if we change the subject’s name. ( From Jane to Bob, or Sanj, or even to Xcfg.) Or the name of the object (From balloon to token, or even to embobler). Or the attributes used to segment them. (From red/blue to heavy/light for example)

Or we can try to rewrite the challenge sentences with different wording. As long as the new sentences convey the same problem you would expect that a system who can “understand” them would generate the same or similar solution.

Curiously this kind of thought experiment also shows a weakness of the Turing-test as originally formulated. A machine correctly solving these word puzzle variations could “prove” that it “understands” the sentences, but it would also reveal that it is not a human. Since i would expect a real human to protest against the inanity of the challenges quite fast. ;)

This goes for humans too. Ultimately, "something similar was in the training set" is semantically indistinguishable from "having a rich generalizable conceptual toolbox".
Except I could do that with a few regex substitutions, which would not be reasoning. The “intelligence” is in the templates provided by the training data. (Extracting that is impressive, but not that impressive.)