| >What are these "thousands of quantitative metrics" on which you base your latest claims? If you have had them on hand all this while, it seems odd that you have not made use of them so far. Hey no offense but I don't appreciate this style of commenting where you say it's "odd." I'm not trying to hide evidence from you and I'm not intentionally lying or making things up in order to win an argument here. I thought of this as a amicable debate. Next time if you just ask for the metric rather then say it's "odd" that I don't present it that would be more appreciated. I didn't present evidence because I thought it was obvious. How are LLMs compared with one another in terms of performance? Usually those are done with quantitative tests. You can feed any number of these tests including stuff like the SAT, BAR, ACT, IQ, SATII etc. They also have LLM targetted tests as well: https://assets-global.website-files.com/640f56f76d313bbe3963... Most of these tests aren't enough though as the LLM is remarkably close to human behavior and can do comparably well and even better than most humans. I mean that last statement I made would usually make you think that those tests are enough, but they aren't because humans can still detect whether or not the thing is an LLM with a longer targetted conversation. The final run is really giving the human with full knowledge of his task a full hour of investigating an LLM to decide whether it's human or a robot. If the LLM can deceive the human that is a hard True/False quantitative metric. That's really the only type of quantitative test left where there is a detectable difference. |
I am still rather confused about how this fits into what you are saying more generally. At first I thought you were saying, in your latest post, that the Turing-test interrogator should be restricted to asking questions from the sets having quantitative metrics in order for it to be an objective process, but that doesn't really hold up, as far as I can see. Frankly, I suspect that the tests with objective metrics are beside the point, and the essence of your position is contained within your final paragraph: "If the LLM can deceive the human [then] that is a hard True/False quantitative metric [and the only sort we can get]."
If so, then (no surprise) I think there are some problems with it, but before I go further, I would like to check that I understand your position.