Hacker News new | ask | show | jobs
by voidhorse 360 days ago
I don't think that's a fair representation of the argument.

The argument is not "here's one failure case, therefore they don't reason". The argument is that systematically if you given an LLM problem instances outside training sets in domains with clear structural rules, they will fail to solve them. The argument then goes that they must not have an actual model or understanding of the rules, as they seem to only be capable of solving problems in the training set. That is, they have failed to figure out how to solve novel problem instances of general problem structures using logical reasoning.

Their strict dependence on having seen the exact or extremely similar concrete instances suggests that they don't actually generalize—they just compute a probability based on known instances—which everyone knew already. The problem is we just have a lot of people claiming they are capable of more than this because they want to make a quick buck in an insane market.

2 comments

That still seems unfalsifiable. If it fails one instance the claim is that the failure is representative of things outside the training set. If it succeeds the claim is that it is in the training set. Without a definitive way to say something is not in the training set (a likely impossible task) the measure of success or failure is the only indicator of the purported reason reason for the success or failure.

Given models can get things wrong even when the training data contains the answer, failure cannot show absence.

I do think there are cases which, in controlled environments, there is some degree of knowledge as to what is in the training set. I also don't thin it's as impossible as you assume.

If you really wanted to ensure this with certainty just use the natural numbers to parameterize an aspect of a general problem. Assume there are N foo problems in the training set, then there is always a case N+1 parameter not in the training set, and you can use this as an indicative case. Go ahead and generate an insane number of these and eventually the probability that the Mth instance is not in the set is effectively 1.

Edit: Of course, it would not be perfect certainty, but it is probabilistically effectively certain. The number of problem instances in the set is necessarily finite, so if you go large enough you get what you need. Sure, you wouldn't be able to say there is a specific problem instance not in the set, but the aggregate results would evidence whether or no the LLm deals with all cases or (on assumption) just known ones.

Well there are models that can sum two many-digit numbers. They certainly have not been trained on every pair of integers up to that level. That either makes the claim they can't do things that they haven't seen trivially false, or the criteria for counting something as being in the training data includes a degree of inference.

What happens when someone makes a claim that they have gotten a model to do something not in the training data and another person claims it must be encoded in the training data in some form. It seems like an impasse.

The lack of rigor and evidence behind the argument is the problem.
It is the side that is arguing that it is reasoning that is lacking rigor and evidence. The side that arguing it isn't is saying you need more rigor and evidence when you claim it is reasoning by pointing out simple cases where it fails.