|
|
|
|
|
by mjburgess
1123 days ago
|
|
And if we're doing Science, ie., trying to explain how ChatGPT works and what it's intrinsic properties are --- this case is far more significant than the other. Inasmuch as the hypothesis that ChatGPT works "so as to be actually sensitive to the meaning of the code" is here falsified -- by a single case. An infinite number of apparent confirmations of this hypothesis are now Invalid! |
|
The practical problem of course is that, in good practice, we estimate the error of a classifier by testing it on (ostensibly) unseen data, i.e. data that was not available to the classifier during training. With LLMs that kind of testing is impossible because nobody knows what's in their training data and so nobody can safely assume that success, or failure, on a specific task, is predictive of the performance of the model on an arbitrarily chosen task.
To make matters worse, everybody should understand very well by now that LLMs' performance varies, even wildly varies, with their prompt, and there is no known way to systematically create prompts that maximise the probability of a desired response. The result of that is that every observation of an LLM failing to carry out a task, may be just that, or it may be an observation of the user failing to prompt the LLM so as to maximise the probability of the correct response.
In a sense, testing LLMs by hand-crafted prompts risks measuring the experimenter's ability to craft a prompt, rather than the LLM's ability to respond correctly. In that sense, we can't really falsify any hypothesis about LLMs' capabilities.
Of course, the flip side of that is that people should refrain from making any such hypotheses and instead working on the best method to systematically and rigorously test LLMs. Too bad very few people are willing to do that. Too bad for most, that is. I'm pretty sure that at some point someone will come up with a way to rigorously test LLMs and take the cookie, and leave everyone else feeling like fools for wasting all that time poking LLMs for nothing.