| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mannykannot 850 days ago

This is rather self-contradictory: you insist we can't make progress with wishy-washy conjectures on vague and fuzzy concepts, and yet your entire argument in this thread for your claim that machine understanding of the real world has been achieved is based on exactly that: your personal subjective assessment of LLM performance!

In your final paragraph, you attempt to suggest that my proposed test is no better than the Turing test (and therefore no better than what you are doing), but as you have not addressed the ways in which my proposal differs from the Turing test, I regard this as merely waffling on the issue. In practice, it is not so easy to come up with tests for whether a human understands an issue (as opposed to having merely committed a bunch of related propositions to memory) and I am trying to capture the ways in which we can make that call.

You entered this debate saying "I think we are way past the point of debate here. LLMs are not stochastic parrots. LLMs do understand an aspect of reality", yet your post here ends with "in the end there's a human in the loop making a judgment call", explicitly acknowledging that your strong initial claims are matters of opinion, rather than established facts supported by hard metrics.

1 comments

ninetyninenine 850 days ago

>This is rather self-contradictory: you insist we can't make progress with wishy-washy conjectures on vague and fuzzy concepts, and yet your entire argument in this thread for your claim that machine understanding of the real world has been achieved is based on exactly that: your personal subjective assessment of LLM performance!

No it's not. I based my argument on a concrete metric. Human behavior. Human input and output.

> I regard this as merely waffling on the issue.

No offense intended but I disagree. There is a difference but that difference is trivial to me. To LLMs talking is also unpredictable. LLMs aren't machines directed to specifically generate creative ideas, they only do so when prompted. Left to its own devices to generate random text does not necessarily lead to new ideas. You need to funnel got in the right direction.

>You entered this debate saying "I think we are way past the point of debate here. LLMs are not stochastic parrots. LLMs do understand an aspect of reality", yet your post here ends with "in the end there's a human in the loop making a judgment call", explicitly acknowledging that your strong initial claims are matters of opinion, rather than established facts supported by hard metrics.

There are thousands of quantitative metrics. LLMs perform especially well on these. Do I refer to one specifically? No. I refer to them all collectively.

I also think you misunderstood. Your idea is about judging an whether an idea is creative or not. That's too wishy washy. My idea is to compare the output to human output and see if there is a recognizable difference. The second idea can easily be put into an experimental quantitative metric in the exact same way the Turing test does it. In fact, like you said it's basically just a Turing test.

Overall AI has passed the Turing test but people are unsatisfied. Basically they need to just make a harsher Turing test to be convinced. For example have people directly know the possibility that the thing inside a computer is possibly an LLM and not a person and have the person directly investigate to uncover the true identity. If the LLM can successfully decieve the human consistently then that is literally the final bar for me..

link

mannykannot 850 days ago

What are these "thousands of quantitative metrics" on which you base your latest claims? If you have had them on hand all this while, it seems odd that you have not made use of them so far.

link

ninetyninenine 849 days ago

>What are these "thousands of quantitative metrics" on which you base your latest claims? If you have had them on hand all this while, it seems odd that you have not made use of them so far.

Hey no offense but I don't appreciate this style of commenting where you say it's "odd." I'm not trying to hide evidence from you and I'm not intentionally lying or making things up in order to win an argument here. I thought of this as a amicable debate. Next time if you just ask for the metric rather then say it's "odd" that I don't present it that would be more appreciated.

I didn't present evidence because I thought it was obvious. How are LLMs compared with one another in terms of performance? Usually those are done with quantitative tests. You can feed any number of these tests including stuff like the SAT, BAR, ACT, IQ, SATII etc.

They also have LLM targetted tests as well:

https://assets-global.website-files.com/640f56f76d313bbe3963...

Most of these tests aren't enough though as the LLM is remarkably close to human behavior and can do comparably well and even better than most humans. I mean that last statement I made would usually make you think that those tests are enough, but they aren't because humans can still detect whether or not the thing is an LLM with a longer targetted conversation.

The final run is really giving the human with full knowledge of his task a full hour of investigating an LLM to decide whether it's human or a robot. If the LLM can deceive the human that is a hard True/False quantitative metric. That's really the only type of quantitative test left where there is a detectable difference.

link

mannykannot 849 days ago

I had no intention of implying any malfeasance in my use of the word "odd"; I mean it in the sense of unusual, unexpected and surprising. The thing is, you finishished your precursor post saying, about your tests and mine, that it comes down to there being a human in the loop making a judgement call, but in a follow-on you say that there are thousands of quantitative metrics. Why, I wondered, would that matter, if it comes down to a human making a judgement call? Were you switching to a different line of argument, one that (as far as I could tell) had not been raised before? That's what I found surprising about your claim.

I am still rather confused about how this fits into what you are saying more generally. At first I thought you were saying, in your latest post, that the Turing-test interrogator should be restricted to asking questions from the sets having quantitative metrics in order for it to be an objective process, but that doesn't really hold up, as far as I can see. Frankly, I suspect that the tests with objective metrics are beside the point, and the essence of your position is contained within your final paragraph: "If the LLM can deceive the human [then] that is a hard True/False quantitative metric [and the only sort we can get]."

If so, then (no surprise) I think there are some problems with it, but before I go further, I would like to check that I understand your position.

link

ninetyninenine 849 days ago

>I had no intention of implying any malfeasance in my use of the word "odd"; I mean it in the sense of unusual, unexpected and surprising. The thing is, you finishished your precursor post saying, about your tests and mine, that it comes down to there being a human in the loop making a judgement call, but in a follow-on you say that there are thousands of quantitative metrics. Why, I wondered, would that matter, if it comes down to a human making a judgement call? Were you switching to a different line of argument, one that (as far as I could tell) had not been raised before? That's what I found surprising about your claim.

It matters because of humans. If I gave an LLM thousands of quantitative tests and it passed them all but in an hour long conversation a human could identify it was an LLM through some flaw the human would consider all those tests useless. That's why it matters. The human making a judgement call is still a quantitative measurement btw as you can limit human output to True or False. But because every human is different in order to get good numbers you have to do measurements with multitudes of humans.

>I am still rather confused about how this fits into what you are saying more generally. At first I thought you were saying, in your latest post, that the Turing-test interrogator should be restricted to asking questions from the sets having quantitative metrics in order for it to be an objective process, but that doesn't really hold up, as far as I can see.

it can still be objective with a human in the loop assuming the human is honest. What's not objective is a human offering an opinion in the form of a paragraph with no definitive clarity on what constitutes a metric. I realize that elements of MY metric have indeterminism to it, but it is still a hard metric because the output is over a well defined set. Whenever you have indeterminism you would then turn to probability and many samples in order to produce a final quantitative result.

>If so, then (no surprise) I think there are some problems with it, but before I go further, I would like to check that I understand your position.

yes my position is that exactly. If all observable qualities indicate it's a duck, then there's nothing more you can determine beyond that, scientifically speaking. You're implying there is a better way?

link

mannykannot 848 days ago

At this point, I think it is worth refreshing what the issue here is, which is whether LLMs understand that the language they receive is about an external world, which operates through causes which have nothing to do with token-combination statistics of the language itself.

> It matters because of humans...

I'm still a bit puzzled here, because it seems to me that the paragraph continuing from here is making the argument that LLM performance on these tests doesn't matter, as far as the question is concerned: in this paragraph you seem to be saying (paraphrased) that despite LLMs' impressive performance on these quantitative tests, they could still fail Turing tests, so their performance on these quantitative tests is not decisive.

> yes my position is that exactly…

The impression I get from what you have written in this post is that you are not claiming that a test conforming to your requirements has actually been successfully performed, you are just assuming it could be?

Regardless, let’s assume (at least for the sake of argument) that the series of tests you propose have been performed, and the results are in: in the test environment, humans can’t distinguish current LLMs from humans any better than by chance. How do you get from that to answering the question we are actually interested in? The experiment does not explicitly address it. You might want to say something like “The Turing test has shown that the machines are as intelligent as humans so, like humans, these machines must realize that the language they receive is about an external world” but even the antecedent of that sentence is an interpretation that goes beyond what would have objectively been demonstrated by the Turing test, and the consequent is a subjective opinion that would not be entailed by the antecedent even if it were unassailable. Do you have a way to go from a successful Turing test to answering the question here, which meets your own quantitative and objective standards?

link