> this is a preprint that has not been peer reviewed.
This conversation is peer review...
You don't need a conference for something to be peer reviewed, you only need... peers...
In fact, this paper is getting more peer review than most works. Conferences are notoriously noisy as reviewers often don't care and are happy to point out criticisms. All works have valid criticisms... Finding criticisms is the easy part. The hard part is figuring out if these invalidate the claims or not.
Honest question: does the opinion of Gary Marcus still count? His criticism seems more philosophical than scientific. It's hard for me see what he builds or reasons to get to his conclusions.
I think this is a fair assessment but reason, and intelligence dont really have an established control or control group. If you build a test and say "Its not intelligent because it can't..." and someone goes out and add's that feature in is it suddenly now intelligent?
If we make a physics break through tomorrow is there any LLM that is going to retain that knowledge permanently as part of its core or will they all need to be re-trained? Can we make a model that is as smart as a 5th grader without shoving the whole corpus of human knowledge into it, folding it over twice and then training it back out?
The current crop of tech doesn't get us to AGI. And the focus to make it "better" is for the most part a fools errand. The real winners in this race are going to be those who hold the keys to optimization: short retraining times, smaller models (with less upfront data), optimized for lower performance systems.
I actually agree with this. Time and again, I can see that LLMs do not really understand my questions, let alone being able to perform logical deductions beyond in-distribution answers. What I’m really wondering is whether Marcus’s way of criticizing LLMs is valid.
I don't know but the standard reply to all of Gary Marcus' criticisms is that they don't count because it's Gary Marcus, which of course is a big honking ad-hominem.
What gets me, and the author talks about it in the post, is that people will readily attribute correct answers to "its in the training set" but nobody says anything about incorrect answers that are in the training set. LLMs get stuff in the training set wrong all the time, but nobody uses it as evidence that it probably can't lean too hard on it's memorization for complex questions it does get right.
It puts LLMs in an impossible position; if they are right, they memorized it, if they are wrong, they cannot reason.
> It puts LLMs in an impossible position; if they are right, they memorized it, if they are wrong, they cannot reason.
Both of those can be true at the same time though. They memorize a lot of things, but its fuzzy and when they remember wrong they cannot fix it via reasoning.
It's more than fuzzy, they are packing exabytes, perhaps zetabytes of training data into a few terabytes. Without any reasoning ability it must be divine intervention that they ever get anything right...
It is divine intervention if you believe human minds are the product of a divine creator. Most of the attribution of miraculous reasoning ability on the part of LLMs I would attribute to pareidolia on the part of their human evaluators. I don’t think we’re much closer at all to having an AI which can replace an average minimum wage full-time worker, who will work largely unsupervised but ask their manager for help when needed, without screwing anything up.
We have LLMs that can produce copious text but cannot stop themselves from attempting to solve a problem they have no idea how to solve and making a mess of things as a result. This puts an LLM on the level of an overly enthusiastic toddler at best.
LLMs are trained with hundreds of terabytes of data to a few petabyte at most. You are off by 3 to 6 orders of magnitude in your estimate of training data. They aren't literally trained on "all the data of the internet". That would be a divergent nightmare. Catastrophic forgetting is still a problem with neural networks and ML algorithms in general. Humans are probably trained on less than half an exabyte of data given the ~1Gbps of sensory data we receive in a lifetime. That's still ~20 petabytes of data by age 5. A 400B parameter LLM with 100 examples per parameter would equal about 640 TB (F16 parameters) of training data. That's the order of magnitude of current models.
Do you hypothese that they see more wrong examples then right? Why is there concern about model collapse if they are reasoning and can sort it out, why does the data even need to be scrubbed before training?
You don't need a conference for something to be peer reviewed, you only need... peers...
In fact, this paper is getting more peer review than most works. Conferences are notoriously noisy as reviewers often don't care and are happy to point out criticisms. All works have valid criticisms... Finding criticisms is the easy part. The hard part is figuring out if these invalidate the claims or not.