Hacker News new | ask | show | jobs
by yldedly 1543 days ago
The point is that not only is it impossible to infer the structure of the world from text, deep learning is incapable of learning about or even representing the world.

The reason language makes sense to us is that it triggers the right representations. It does not make sense intrinsically, it's just a sequence of symbols.

Learning about the world requires at least causal inference, modular and compact representations such as programming languages, and much smarter learning algorithms than random search or gradient descent.

2 comments

I don't know why you think this. There is much structural regularity in a large text corpus that is descriptive of relationships in the world. Eventually the best way to predict this regularity is just to land in a portion of parameter space that encodes the structure. But again, in the limit of a maximally descriptive text corpus, the best way to model this structure is just to encode the structure of the world. You have given no reason to think this is inherently impossible.
>There is much structural regularity in a large text corpus that is descriptive of relationships in the world.

Sure, there is a lot. But let's say we want to learn what apples are. So we look at occurrences of "apple" in the text corpus, and learn that apples can be eaten, they can be sweet, sometimes they are sour, red, sometimes green, and so on.

Can apples spontaneously change size? Hmm, no idea, no mention of that in the text. Can they be used as a knife? Dunno. If I had an apple in New York 4 minutes ago, can someone else be eating the same apple in Hong Kong now? Dunno. Did apples exist in France two million years ago? Dunno. Can you drive to Saturn in an apple? Dunno.

In short, there's no actual model of what an apple is, as an object in space, connected to other objects by various relationships. If there were, the model could figure out the answers to the questions above by inheritance.

Maybe these particular questions happen to be answered correctly by PaLM. Maybe not, but the next LLM will include this comment in the training corpus.

But the reason GPT-3 and other models tend to make no sense is because their output is not constrained by reality. The text in the training corpus tends to conform to reality, but when you prompt the model with questions that nobody would ever write text about, the illusion falls away:

Q: How many eyes does a giraffe have? A: A giraffe has two eyes.

Q: How many eyes does my foot have? A: Your foot has two eyes.

Q: How many eyes does a spider have? A: A spider has eight eyes.

Q: How many eyes does the sun have? A: The sun has one eye.

Q: How many eyes does a blade of grass have? A: A blade of grass has one eye.

Q: How do you sporgle a morgle? A: You sporgle a morgle by using a sporgle.

Q: How many bonks are in a quoit? A: There are three bonks in a quoit.

Q: How many rainbows does it take to jump from Hawaii to seventeen? A: It takes two rainbows to jump from Hawaii to seventeen.

Q: Which colorless green ideas sleep furiously? A: Ideas that are colorless, green, and sleep furiously are the ideas of a sleep furiously.

Q: Do you understand these questions? A: I understand these questions.

(from https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.h...)

>In short, there's no actual model of what an apple is, as an object in space, connected to other objects by various relationships.

I don't know why you think language models are fundamentally unable to deduce the knowledge of the points you mention. Much knowledge isn't explicitly stated, but is implicit and can be deduced from a collection of explicit facts. For example, apples are food, food is physical matter, physical matter is fixed in size, cannot be in two places at once, maintains its current momentum unless acted on by a force, etc. Categorization and deducing properties from an object's category is in parameter space of language models. There's no reason to think that a sufficiently large model will not land on these parameters.

>But the reason GPT-3 and other models tend to make no sense is because their output is not constrained by reality.

The issue isn't what GPT-3 can or cannot do, its about what autoregressive language models as a class are capable of. Yes, there are massive holes in GPT-3's ability to maintain coherency across wide ranges of contexts. But GPT-3's limits does not imply a limit to autoregressive language models more generally.

>I don't know why you think language models are fundamentally unable to deduce the knowledge of the points you mention.

Because the knowledge is not there in the text, the models are not able to represent it, and as seen in the demonstration above, they don't have it.

The demonstration is irrelevant. The issue isn't what GPT-3 can or cannot do, but what this class of models can do.

Reduce knowledge to particular kinds of information. Gradient descent discovers information by finding parameters that correspond to the test criteria. Given a large enough data set that is sufficiently descriptive of the world, the "shape" of the world described by the data admits better and worse structures to predict the data. The organizing and association of information that we call knowledge is a part of the parameter space of LLMs. There is no reason to think such a learning process cannot find this parameter space.

It sounds like you're arguing that GPT doesn't work because it cannot work. However, it does work.

So how does PaLM understand causal chains and explain jokes that it has never seen before?

It doesn't. It's pattern matching, and you're seeing cherry picked examples. The pattern matching is enough to give the illusion of understanding. There's plenty of articles where more thorough testing reveals the difference. Here are two: https://medium.com/@melaniemitchell.me/can-gpt-3-make-analog...

But you could also just try one of these models, and see for yourself. It's not exactly subtle.

https://www.technologyreview.com/2020/08/22/1007539/gpt3-ope...

GPT-3 was specifically worse at jokes, which is why PaLM being good at this so impresses me. At any rate, I don't care if it only works one in ten times. To me, this is equivalent to complaining that the dog has bad marks in high school. (PaLM could probably explain that one to you: "The speaker is complaining that the dog is only getting C's. For a human a C is a quite bad mark. However getting even a C is normally impossible for a dog.")

"It's pattern matching" just sounds like an excuse for why it working "doesn't really count". At this point, you are asking me to disbelieve plain evidence. I have played with these models, people I know have played with these models, I have some impression of what they're capable of. I'm not disagreeing it's "just pattern matching", whatever that means, I am asserting that "pattern matching" is Turing-complete, or rather, cognition-complete, so this is just not a relevant argument to me.

What do you think a neuron does?

>At any rate, I don't care if it only works one in ten times

>you are asking me to disbelieve plain evidence

If you threw a thousand tries at a Markov chain, to use the classic "pure pattern matcher", it could not do any fraction of what this model does, ever, at all. You would have to throw enough tries at it that it tried every number that could possibly come next, to get a hit. So one in ten is actually really good. (If that's the rate, we have zero idea how cherrypicked their results actually are.)

And the errors that GPT does tend to be off-by-one errors, human errors, misunderstandings, confusions. It loses the plot. But a Markov chain never even has the plot for an instant.

GPT pattern-matches at an abstract, conceptual level. If you don't understand why that is a huge deal, I can't help you.

It's a pretty big deal, and there's a big difference between a Markov chain and a deep language model - the Markov chain will quickly converge, while the language model can scale with the data.

But the way these models are talked about is misleading. They don't "answer questions", "translate", "explain jokes", or anything of that sort. They predict missing words. Since the network is so large, and the dataset has so many examples, it can scale up the method of 1) Find a part of the network which encodes training data that is most similar to the prompt 2) Put the words from the prompt in place of the corresponding words in the encoding of the training data

i.e. pattern matching. So if it has seen a similar question to the one given in the prompt (and given that it's trained on most of the internet, it will find thousands of uncannily similar questions), it will produce a convincing answer.

How is that different from a human answering questions? A human uses pattern matching as part of the process, sure. But they also use, well, all the other abilities that together make up intelligence. They connect that meaningless symbols of the sentence to the mental representations that model the world - the ones pertaining to whatever the question is about.

If I ask a librarian "What is the path integral formulation of quantum mechanics?", and they come back with a textbook and proceed to read the answer from page 345, my reaction is not "Wow, you must be a genius physicist!", it's "Wow, you sure know where to find the right book for any question!". In the same way, I'm impressed with GPT for being a nifty search engine, but then again, Google search does a pretty good job of that already.