Hacker News new | ask | show | jobs
by berndi 961 days ago
You’re confused about what “statistical parrot” means and you don’t seem to understand the difference between an optimization objective and the resulting model.

The term “parrot” is used to imply inference by something akin to a look-up table, specifically it is used to indicate poor out-of-sample performance and a lack of a proper world model. The optimization objective is irrelevant when determining the generalization performance of a model and when judging whether it can reason beyond looking up answers in a table.

As the user above noted, it is now quite well established that GPT-4 has impressive out-of-sample performance which can be explained by it possessing an actual model of the world and not being a “parrot”.

2 comments

> it is now quite well established that GPT-4 has impressive out-of-sample performance

Err... I can show this is false, kinda trivially. People who engage in prompt-confirmation-bias aren't aware of what the in-sample is.

It's basically everything ever digitised: you can ask it for the first paragraph of every dickens novel, to what the average petal length of an iris flower is -- etc.

How are you measuring the in-sample here?

If you engage in straightfoward reasoning from first principles, and are basically aware of what the training data is, you can show in 10 seconds critical failures of generalisation.

If you want a recipe: go find some fringe api docs. Establish that it has been trained on them. Then, since they're fringe there wont be much code on github, etc. Now ask it do something non-trivial with that API. It will fail, and the mechanism will be obvious: it'll jam in correlated code that lacks relevance.

Do the same on a popular API, and see it succeed.

The in-sample will be obvious for both, and the bounday of generalisation

You can make it invent a new language: https://maximumeffort.substack.com/p/i-taught-chatgpt-to-inv...

I am sure you will continue to argue that this is still in line with everything-thats-ever-written prediction but my opinion is that at that point, it's a meaningless distinction. The human brain is also just a machine.

So I was with a financial researcher recently, and he wanted to use ChatGPT to summarise some reference financial data -- and it did so, actually correctly.

Being sceptical, as every person ought in these matters, I changed the finical data and performed the same analysis (both in a new tab, and within the same convo). The results were the same!

How strange?

Well, in being reference financial data ChatGPT was reporting prior reference summaries of it. When that data was changed it was reporting the very same reference summaries (which were now wrong).

Since it's incapable of actually summarising financial data. It's only capable of selecting combinations of pieces of its training set.

Now, is this distinction "meaningless" ?

No, it's the difference between this guy being fired for causing a massive loss on a major project; and this guy keeping his job and doing it well.

>Since it's incapable of actually summarising financial data. It's only capable of selecting combinations of pieces of its training set.

Third completely off misconception from you today.

This is not at all what it is doing. "Supercharged Interpolation" is false and makes no sense. It's not a lookup table either. It doesn't memorize enough of what it needs to to make your assertion possible.

https://arxiv.org/abs/2110.09485

at 500gb, you can store nearly everything ever written -- let alone compressed.

all statistical learning is a variation on k-nn (see the relevant paper on this) but likewise this is obvious a priori

k-nn is the ideal learner, and a good starting point for analysis

the question for any given system is: what is the learning space, what is the distance function, and how many points are being considered

NNs set up a compressed X,y space, in that space choose points via an empirical expectation, and obtain a weighted average as their prediction

That's just what they do -- there isn't any other mechanism here. The whole formal structure of the NN can be written down on a page of paper

your paper above doesn't deal with this -- it's a reply to the 'forced interpolation' view, which i haven't espoused. but often NNs are forced interpolated

'extrapolation' is of course a part of the possible predictive output of a statical learning system -- in that it's latent space is taken to be embedded in R^n and so one can 'veer off' into R.

Whenever you attribute a higher fidelity space to a small latent space you are, in effect, extrapolating

>at 500gb, you can store nearly everything ever written -- let alone compressed.

No you cannot.

>That's just what they do -- there isn't any other mechanism here.

That's not what they do. They are many papers now showing ICL demonstrating some kind of optimization method during inference which would not be happening if all they did was retrieval.

I'm come to realize you don't know what you're talking about. Your level of denial is scary to see.

>Since it's incapable of actually summarising financial data

It's not, though. It is in fact able to summarize financial data, just as it's able to write code and diagnose a medical condition. It makes mistakes, yes, even grave ones, much more so than experts in those fields would.

It isnt making mistakes ... its never actually doing it.

Do you see a difference between the process of adding numbers and dividing by their count (taking a mean) and emitting numeric tokens which are most probable for a given input?

The former is called "taking a mean" the latter isnt. This system never engages in any method to summarise financial data. It's method is always the same: to emit tokens most probable given a set of historical tokens.

It's the difference between saying "the average of 1,2,3" is 2 because that sentence occurs 1,000,000 times and saying it's 2 because you've literally computed it.

This system does not run financial summary algorithms. It's a trick

To add to your point: try asking ChatGPT to do basic arithmetic on numbers it hasn't seen before. You'll see just how good it is at computation.
The brain is a machine, the issue is the difference between 2 claims

LLMs are enough to be a brain

LLMs are not enough to be a brain.

But “everything ever digitised” includes a tonne of linguistics information - it’s still in sample.
That out of sample performance is a mirage.

Yes it’s impressive. Yes it’s got amazing zero shot performance in domains.

But there’s a pattern of failure in production which describe a limit, that shouldn’t exist if the emergent properties were stable.

You can build this right now and test it.

Build a sequence of agents to work on a domain you are not an expert in.

Let them loose. See what happens.

Do the same thing on a domain you have expertise in.

Assume the number of errors you find, the number of modifications you have to make are stable for other domains.

I'd phrase characterizing the reliability of out-of-sample performance a priori as impossible, but not necessarily automatically failing.

There may be a subtle correlation between properties needed to answer a specific out-of-sample request and in-sample features.

Unfortunately, prior to training/testing and without recognizing that correlation in the data set, I believe it's impossible to guarantee the model will include it. (Corrections welcome)

In essence: “You cant know in advance how far the model can approximate semantic patterns”

So claiming that out-of-sample performance is a mirage, would be a bridge too far?

Maybe "a mirage that might actually be true"? Which is a terrible thing to rely on! Unless it's usually true?
That measurement is the core of my current tasks. If you don’t know the error rate - then what are you doing ?
Delivering what some executive promised when they told investors 'the company is using AI.' /s