|
>I had no intention of implying any malfeasance in my use of the word "odd"; I mean it in the sense of unusual, unexpected and surprising. The thing is, you finishished your precursor post saying, about your tests and mine, that it comes down to there being a human in the loop making a judgement call, but in a follow-on you say that there are thousands of quantitative metrics. Why, I wondered, would that matter, if it comes down to a human making a judgement call? Were you switching to a different line of argument, one that (as far as I could tell) had not been raised before? That's what I found surprising about your claim. It matters because of humans. If I gave an LLM thousands of quantitative tests and it passed them all but in an hour long conversation a human could identify it was an LLM through some flaw the human would consider all those tests useless. That's why it matters. The human making a judgement call is still a quantitative measurement btw as you can limit human output to True or False. But because every human is different in order to get good numbers you have to do measurements with multitudes of humans. >I am still rather confused about how this fits into what you are saying more generally. At first I thought you were saying, in your latest post, that the Turing-test interrogator should be restricted to asking questions from the sets having quantitative metrics in order for it to be an objective process, but that doesn't really hold up, as far as I can see. it can still be objective with a human in the loop assuming the human is honest. What's not objective is a human offering an opinion in the form of a paragraph with no definitive clarity on what constitutes a metric. I realize that elements of MY metric have indeterminism to it, but it is still a hard metric because the output is over a well defined set. Whenever you have indeterminism you would then turn to probability and many samples in order to produce a final quantitative result. >If so, then (no surprise) I think there are some problems with it, but before I go further, I would like to check that I understand your position. yes my position is that exactly. If all observable qualities indicate it's a duck, then there's nothing more you can determine beyond that, scientifically speaking. You're implying there is a better way? |
> It matters because of humans...
I'm still a bit puzzled here, because it seems to me that the paragraph continuing from here is making the argument that LLM performance on these tests doesn't matter, as far as the question is concerned: in this paragraph you seem to be saying (paraphrased) that despite LLMs' impressive performance on these quantitative tests, they could still fail Turing tests, so their performance on these quantitative tests is not decisive.
> yes my position is that exactly…
The impression I get from what you have written in this post is that you are not claiming that a test conforming to your requirements has actually been successfully performed, you are just assuming it could be?
Regardless, let’s assume (at least for the sake of argument) that the series of tests you propose have been performed, and the results are in: in the test environment, humans can’t distinguish current LLMs from humans any better than by chance. How do you get from that to answering the question we are actually interested in? The experiment does not explicitly address it. You might want to say something like “The Turing test has shown that the machines are as intelligent as humans so, like humans, these machines must realize that the language they receive is about an external world” but even the antecedent of that sentence is an interpretation that goes beyond what would have objectively been demonstrated by the Turing test, and the consequent is a subjective opinion that would not be entailed by the antecedent even if it were unassailable. Do you have a way to go from a successful Turing test to answering the question here, which meets your own quantitative and objective standards?