Hacker News new | ask | show | jobs
by yldedly 1543 days ago
>At any rate, I don't care if it only works one in ten times

>you are asking me to disbelieve plain evidence

1 comments

If you threw a thousand tries at a Markov chain, to use the classic "pure pattern matcher", it could not do any fraction of what this model does, ever, at all. You would have to throw enough tries at it that it tried every number that could possibly come next, to get a hit. So one in ten is actually really good. (If that's the rate, we have zero idea how cherrypicked their results actually are.)

And the errors that GPT does tend to be off-by-one errors, human errors, misunderstandings, confusions. It loses the plot. But a Markov chain never even has the plot for an instant.

GPT pattern-matches at an abstract, conceptual level. If you don't understand why that is a huge deal, I can't help you.

It's a pretty big deal, and there's a big difference between a Markov chain and a deep language model - the Markov chain will quickly converge, while the language model can scale with the data.

But the way these models are talked about is misleading. They don't "answer questions", "translate", "explain jokes", or anything of that sort. They predict missing words. Since the network is so large, and the dataset has so many examples, it can scale up the method of 1) Find a part of the network which encodes training data that is most similar to the prompt 2) Put the words from the prompt in place of the corresponding words in the encoding of the training data

i.e. pattern matching. So if it has seen a similar question to the one given in the prompt (and given that it's trained on most of the internet, it will find thousands of uncannily similar questions), it will produce a convincing answer.

How is that different from a human answering questions? A human uses pattern matching as part of the process, sure. But they also use, well, all the other abilities that together make up intelligence. They connect that meaningless symbols of the sentence to the mental representations that model the world - the ones pertaining to whatever the question is about.

If I ask a librarian "What is the path integral formulation of quantum mechanics?", and they come back with a textbook and proceed to read the answer from page 345, my reaction is not "Wow, you must be a genius physicist!", it's "Wow, you sure know where to find the right book for any question!". In the same way, I'm impressed with GPT for being a nifty search engine, but then again, Google search does a pretty good job of that already.

I don't know what to tell you. They specifically showed PaLM novel jokes. You're effectively saying that the paper is either mistaken or fraudulent.

In my experience with language models, what they do cannot be reduced to madlibs. But that's obviously not an argument I can prove to you.

Can we agree that if the model can explain structurally novel jokes, then it must have some measure of true understanding?

Understanding of what? What the joke is about? Then no, it has no idea what any of it means. The syntactic structure of jokes? Sure. Feed it 10 thousand jokes that are based on a word found in two otherwise disjoint clusters (pod of whales, pod of TPUs), with a subsequent explanation. It's fair to say it understands that joke format.

If you somehow manage to invent a kind of joke never before seen in the vast training corpus, that alone would be impressive. If PaLM can then explain that joke, I will change my mind about language models, and then probably join the "NNs are magic you guys" crowd, because it wouldn't make any sense.

Good point, coming up with a novel joke is no joke. There's a genuine problem where GPT is to a first approximation going to have seen everything we'll think of to test it, in some form or other.

Of course, if we can't come up with something sufficiently novel to challenge it with, that also says something about the expected difficulty of its deployment. :-P

I guess once we find a more sample-efficient way to train transformers, it'll become easier to create a dataset where some entire genre of joke will be excluded.