|
|
|
|
|
by tel
624 days ago
|
|
I tried to replicate this and Claude 3.5 Sonnet got it correct on the first try. It generated a second set of dates which contained no solution so I asked it to write another python program that generates valid date sets. Here's the code it generated: https://gist.github.com/tel/8e126563d2d5fb13e7d53cf3adad862e To my test, it has absolutely no trouble with this problem and can correctly translate the "theory of mind" into a progressive constraint solver. Norvig is, of course, a well-respected researcher, but this is a bit disappointing. I feel confident he found that his tests failed, but to disprove his thesis (at least as is internally consistent with his experiment) we just need to find a single example of an LLM writing Python code that realizes the answer. I found that on the first try. I think it's possible that there exists some implementation of this problem, or something close enough to it, already in Claude's training data. It's quite hard to disprove that assertion. But still, I am satisfied with the code and its translation. To relate the word problem to this solution requires contemplation of the character's state-of-mind as a set of alternatives consistent with the information they've been given. |
|
That's good but no cigar and it certainly didn't get it "correct on the first try". First it generated a partially correct solution. Then you had to prompt it again to generate a new program. You were only able to do that because you know what the right answer looks like in the first place. The second program is missing a second set of dates so it's not clear if it really gets it right the second time or just reproducing a different program from its training set without understanding the problem and what's wrong with the first program and just because you asked it to do so instead.
>> I feel confident he found that his tests failed, but to disprove his thesis (at least as is internally consistent with his experiment) we just need to find a single example of an LLM writing Python code that realizes the answer. I found that on the first try.
That's not how testing LLM code generation is done in practice, exactly because of the variance that can be expected in generated results. To properly test an LLM (which I would agree Norvig's experiment falls a little short off) one has to run multiple experiments and evaluate all the results in aggregate in some form. The usual way to do it is to draw k samples from the LLM's distribution and check whether the correct answer is generated at least n times (k@n metric). That's an awful metric because it's basically allowing arbitrary "guesses" until the LLM code generator gets it right. A simpler test is to generate k programs, check whether each program is right or wrong, and assign 1 for each correct answer and 0 for each incorrect answer, then average over all answers. It's an open question whether to count a partial answer as a 0, or 0.5.
So if we took the total failure in Norvig's experiment and the only partial success in yours, and allowing for the most charitable aggregation of results, we have something like 0.25 accuracy, which is not that impressive. Not least because it's evaluated on just two test samples.
Also, please don't underestimate the knowledge of experts like Peter Norvig.