Sometimes! My attempt with GPT-4 yields a response where it acknowledges the print/len swap, but does not produce correct code in the end - it sort of loses track of what the original goal was. https://chat.openai.com/share/300382cb-ac72-4a75-847c-ecbf5a...
And if we're doing Science, ie., trying to explain how ChatGPT works and what it's intrinsic properties are --- this case is far more significant than the other.
Inasmuch as the hypothesis that ChatGPT works "so as to be actually sensitive to the meaning of the code" is here falsified -- by a single case.
An infinite number of apparent confirmations of this hypothesis are now Invalid!
I'm not comfortable with this introduction of falsificationism to what is not a scientific experiment, but only an experiment testing the predictive accuracy of a classifier. Of course the classifier will get it wrong sometimes because it's only approximating a function: that's by definition, and even by design i.e. we build classifiers as function approximators because we know that learning precise definitions of target concepts is really hard. Under PAC-Learning assumptions, we expect a classifier to have some probability of some error, and we are only trying to estimate the probability of a certain degree of error in the classifier's decision.
The practical problem of course is that, in good practice, we estimate the error of a classifier by testing it on (ostensibly) unseen data, i.e. data that was not available to the classifier during training. With LLMs that kind of testing is impossible because nobody knows what's in their training data and so nobody can safely assume that success, or failure, on a specific task, is predictive of the performance of the model on an arbitrarily chosen task.
To make matters worse, everybody should understand very well by now that LLMs' performance varies, even wildly varies, with their prompt, and there is no known way to systematically create prompts that maximise the probability of a desired response. The result of that is that every observation of an LLM failing to carry out a task, may be just that, or it may be an observation of the user failing to prompt the LLM so as to maximise the probability of the correct response.
In a sense, testing LLMs by hand-crafted prompts risks measuring the experimenter's ability to craft a prompt, rather than the LLM's ability to respond correctly. In that sense, we can't really falsify any hypothesis about LLMs' capabilities.
Of course, the flip side of that is that people should refrain from making any such hypotheses and instead working on the best method to systematically and rigorously test LLMs. Too bad very few people are willing to do that. Too bad for most, that is. I'm pretty sure that at some point someone will come up with a way to rigorously test LLMs and take the cookie, and leave everyone else feeling like fools for wasting all that time poking LLMs for nothing.
It's not black and white with these probabilistic models. The same input generated two outputs. Both were "actually sensitive to the meaning of the code", to varying degrees. One got it exactly right, one made an error, but partly got it right.
Inasmuch as the hypothesis that ChatGPT works "so as to be actually sensitive to the meaning of the code" is here falsified -- by a single case.
An infinite number of apparent confirmations of this hypothesis are now Invalid!