| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gptadmirer 1337 days ago
	But how does probability distribution over sequences of consecutive tokens can create new things? Like, I saw the other day it creates a C code that creates a Lisp code that creates a Pascal code. Is this based on an entirely previous creation?

3 comments

qsort 1337 days ago

It doesn't create anything new. It creates things that look new.

The code examples are perfect case studies, they don't actually work. They aren't just slightly wrong, they're completely nonsensical.

Another example is "is <number> prime?", it can't answer things like that, and it will make up something that may or may not be accurate.

The model has no concept of what is true or false, it's essentially trying to predict what is the most likely token to come next.

It seems to know stuff because the knowledge comes from the dataset, hence techniques like zero-shot, few-shot and prompt-based learning.

link

unoti 1337 days ago

> It doesn't create anything new. It creates things that look new.

This is not technically true. It can and does create things that are new. There are lots of new poems and jokes right here in this thread. I asked it, for example, to give me its top 10 reasons why Bigfoot knocks on camper trailers, and one of its answers was "because it likes to play with its food." I did a lot of searching to try to find this joke out there on the internet, and could not. I've also had it create Weird Al style songs for a variety of things, and it does great.

If these aren't new creations, I'm not sure what your threshold is for creating something new. In a sense I can see how you can say that it only "looks" new, but surely the essays generated by students worldwide mostly only "look" new, too...

link

LeanderK 1337 days ago

ChatGPT has create a poem to cheer up my sick girlfriend. I have written a bit how she feels, what she has (just the flu) and what I did to cheer her up. ChatGPT created a decent poem with exactly fitted my description but was a bit dramatic, she's not dying just tired of being sick. I have asked ChatGPT to create a less dramatic version that rhymes more and ChatGPT just did it. Amazing. I have also googled parts of it but didn't find them! This certainly counts as novel or I would also be totally unable to create novel poems about my sick girlfriend (because I have read poems about girlfriends before?!).

A good idea when dismissing those machine learning models is to check whether a human would pass your standards. I miss the aspect when the dismissive "they only interpolate or memorise" arguments come. I am also quite bounded by my knowledge or what I have seen. Describe something I have never seen to me and ask me to draw it, I would fail in a quite hilarious way.

Hilariously, ChatGPT is also quite bad at arithmetic, like myself. I thought this is what machines are supposed to be good at!

link

underwater 1337 days ago

People solve this by getting the GPT to describe a series of computations and then running those steps externally (e.g. asking GPT what Python code to run).

Thats not so different from how humans do this. When we need to add or multiply we switch from freeform thought to executing the Maths programs that were uploaded into our brains at school.

link

squirrel 1337 days ago

If I recall correctly, in his paper on whether machines could think, Turing gives an imaginary dialogue with a computer trying to pass as a human (what we later came to call the Turing test) where the judge poses an arithmetic problem, and the computer replies after a pause of 30 seconds — with the wrong answer.

link

underwater 1337 days ago

That joke is a great example of why the creativity is surprising.

A human might have a thought process that starts with the idea that people are food for Bigfoot, and then connects that to phrase of "playing with your food".

But GPT generates responses word by word. And it operates at a word (token) level, rather than thinking about the concepts abstractly. So it starts with "Because it likes to play" which is a predictable continuation that could end in many different ways. But it then delivers the punchline of "with its food".

Was it just a lucky coincidence that it found an ending to the sentence that paid off so well? Or is the model so sophisticated that it can suggest word "plays" because it can predict the punchline related to "food".

link

mk_stjames 1337 days ago

I think what you are saying is just not true in the sense GPT style LLMs. The output is not just single word generation at a time. It is indeed taking into account the entire structure, preceding structures, and to a certain extent abstractions inherent to the structure throughout the model. Just because it tokenizes input doesn't mean it is seeing things word by word or outputting word by word. Transformers are not just fancy LSTMs. The whole point of transformers is it takes the input in parallel, where RNNs are sequential.

link

underwater 1337 days ago

It seems I'd gotten the wrong impression of how it works. Do you have any recommendations for primers on GPT and similar systems? Most content seems to be either surface level or technical and opaque.

link

fjkdlsjflkds 1337 days ago

No. You got the right impression. It is indeed doing "next token prediction" in an autoregressive way, over and over again.

The best source would be the GPT-3 paper itself: https://paperswithcode.com/method/gpt-3

link

carabiner 1337 days ago

I wish someone what pass it the entirety of an IQ test. I bet it would score around 100, since no it does seem to get some logic questions wrong.

link

mk_stjames 1337 days ago

Well since it is only a text input AI it could only possibly attempt to do the VIQ part of a Weschler style IQ test, since the PIQ part requires understanding image abstractions (arrangements, block design, matrices of sequences etc).

I know there were some deep learning papers on how to train a model to pass the PIQ portion without human-coded heuristics (because, you could easily write a program to solve such questions if you knew ahead of time the format of the questions). I don't remember the outcomes however.

link

emmelaich 1337 days ago

It got 52% in a SAT exam. Better than most people.

link

LeanderK 1337 days ago

I have seen a score of 83 on twitter

link

gptadmirer 1337 days ago

Interesting, but I wonder how does it have the ability to combine those. i.e, creating a song in a KJV/spongebob style, or creating a code that writes a code that writes a code.

link

espadrine 1337 days ago

“create a song in spongebob style” will be cut into tokens which are roughly syllables (out of 50257 possible tokens), and each token is converted to a list of 12288 numbers. Each token always maps to the same list, called its embedding; the conversion table is called the token embedding matrix. Two embeddings with a short distance occur within similar concepts.

Then each token’s embedding is roughly multiplied with a set of matrices called “attention head” that yield three lists: query, key, value, each of 128 numbers behaving somewhat like a fragment of an embedding. We then take the query lists for the past 2048 tokens, and multiply each with the key lists of each of those 2048 tokens: the result indicates how much a token influences another. Each token’s value list get multiplied by that, so that the output (which is a fragment of an embedding associated with that token, as a list of 128 numbers) is somewhat proportional to the value list of the tokens that influence it.

We compute 96 attention heads in parallel, so that we get 128×96 = 12288 numbers, which is the size of the embedding we had at the start. We then multiply each with weights, sum the result, pass it through a nonlinear function; we do it 49152 times. Then we do the same again with other weights, but only 12288 times, so that we obtain 12288 numbers, which is what we started with. This is the feedforward layer. Thanks to it, each fragment of a token’s embedding is modified by the other fragments of that token’s embedding.

Then we pass that output (a window of 2048 token embeddings, each of 12288 numbers) through another multi-attention head, then another feedforward layer, again. And again. And again. 96 times in total.

Then we convert the output to a set of 50257 numbers (one for each possible next token) that give the probability of that token being the next syllable.

The token embedding matrix, multi-head attention weights, etc. have been learned by computing the gradient of the cross-entropy (ie. roughly the average likelihood of guessing the next syllable) of the model’s output, with respect to each weight in the model, and nudging the weights towards lower entropy.

So really, it works because there is a part of the embedding space that knows that a song is lyrical, and that a part of the attention head knows that sponge and bob together represent a particular show, and that a part of the feedforward layer knows that this show is near “underwater” in the embedding space, and so on.

link

jimbokun 1337 days ago

Nobody really knows, because the model is too large and complex to really analyze.

link

CamperBob2 1337 days ago

It doesn't create anything new.

Who does? This is nothing but a "God of the Gaps" argument in reverse.

link

visarga 1337 days ago

Sounds like you are thinking of language models in isolation, working in closed-book mode. That is just the default, it doesn't need to be how they are used in practice.

Do you know language models can use external toys, such as a calculator. They just need to write <calc>23+34=</calc> and they get the result "57" automatically added. The same, they can run <search>keyword</search> and get up to date snippets of information. They could write <work>def is_prime(x): ... print(is_prime(57))</work> and get the exact answer.

I think the correlation pattern in language is enough to do real work, especially when fortified with external resources. Intelligence is most likely a property of language, culture and tools, not of humans and neural networks.

link

ilaksh 1337 days ago

I've been using it to write code for my business. It's often not perfect, but usually you can say fix bug XX in the code you gave me and it works.

link

pyuser583 1337 days ago

The model also really loves stock phrases and platitudes.

link

kwertyoowiyop 1337 days ago

“As a large language model trained by OpenAI, I do not have personal preferences or emotions. My primary function is to provide accurate and informative responses to questions based on the data I have been trained on. I am not capable of experiencing emotions or using stock phrases or platitudes.”

link

cwkoss 1337 days ago

If it gives you broken code, you can tell it to fix the code and it often will

link

qsort 1337 days ago

Sometimes it will, sometimes it won't. The point is that it's "random", it has no way to tell truth from falsity.

Language models are unsuitable for anything where the output needs to be "correct" for some definition of "correct" (code, math, legal advice, medical advice).

This is a well-known limitation that doesn't make those systems any less impressive from a technical point of view.

link

randomsearch 1337 days ago

How can this interface be useful as a search engine replacement if the answers are often incorrect?

Can we fix it?

Because earlier today it told me that George VI was currently king of England. And I asked it a simple arithmetic question, which it got subtly wrong. And it told my friend there were a handful of primes less than 1000.

Everyone’s talking about it being a Google replacement. What’s the idea? That we train it over time by telling it when things are wrong? Or is the reality that these types of language models will only be useful for generating creative output?

link

cwkoss 1337 days ago

there are plenty of google queries that return incorrect answers, and they've been operating for decades

link

randomsearch 1336 days ago

It's not the same.

If you ask a chat interface a question and it says "this is true", that's very different from a search engine containing a list of results where one of them might be untrue.

For one thing, you can look at all the queries and take a majority vote etc. Second, you can look at the source to see if it's trustworthy.

link

emmelaich 1337 days ago

Doctors are often not totally correct, but they're useful.

link

thepasswordis 1337 days ago

It absolutely replies to “is <number> prime” with the correct answer.

link

CamperBob2 1337 days ago

Also "Why is <number> interesting?" is an interesting question to ask. It finds something interesting about most integers, and falls back to giving you a good rational approximation for 'uninteresting' reals.

link

cesarb 1337 days ago

> It finds something interesting about most integers

Every integer is interesting. "[...] if there exists a non-empty set of uninteresting natural numbers, there would be a smallest uninteresting number – but the smallest uninteresting number is itself interesting because it is the smallest uninteresting number [...]" (https://en.wikipedia.org/wiki/Interesting_number_paradox)

link

skykooler 1337 days ago

A few days ago, I got to thinking about this. On a theoretical level every integer is interesting due to that, but on a practical level there's not much point calling a number "interesting" if the only property giving it that quality is being the first uninteresting number larger than n-1.

One way of checking what makes a number interesting is seeing what sequences it is in in the OEIS[1]. Interesting properties (squares, primes, palindromes, friendly numbers, etc) all have the first hundred or so numbers with those properties in a list. So I downloaded the dataset of all sequences from OEIS, and iterated through all integers to find the first one that is not in any sequence in the OEIS.

So, I present the smallest uninteresting number: 20,067. Every number before it is in at least one of the sequences - for example, 20,066 is 'a partial sum for a coordination sequence for the "pcu-i" 3D uniform tiling'. It is the first number that has no interesting properties other than trivial ones (like being a multiple of 3). It is the first uninteresting number, but one could say that makes it meta-interesting rather than being interesting in and of itself.

[1] https://oeis.org/

Edit: and after doing all that work I saw that someone else has already done so and put it in the wikipedia article. Oh well, at least I learned about set processing in Python.

link

jonahx 1337 days ago

is 4021 prime

No, 4021 is not a prime number. A prime number is a whole number greater than 1 that is divisible only by 1 and itself. 4021 is divisible by 17, which is a number other than 1 and itself, so it is not considered to be a prime number.

link

thepasswordis 1337 days ago

For the curious, 4021 is not divisible by 17.

I guess I only tried the first few single digit primes. Fair enough!

link

unoti 1337 days ago

> But how does probability distribution over sequences of consecutive tokens can create new things?

If you start a sentence with a few words, think about the probability for what the next word might be. Imagine a vector (list) with a probability for every single other word in the language, proper nouns included. This is a huge list, and the probabilities of almost everything are near zero. If you take the very highest probability word, you'll get a fairly predictable thing. But if you start taking things a little lower down the probability list, you start to get what amounts to "creativity" but is actually just applied statistics plus randomness. (The typical threshold to use for how high the probability of a selected word should be is called the "temperature" and is a tunable parameter in these models usually.) But when you consider the fact that it has a lot of knowledge about how the world works and those things get factored into the relative probabilities, you have true creativity. Creativity is, after all, just trying a lot of random thoughts and throwing out the ones that are too impractical.

Some models, such as LaMDA, will actually generate multiple random responses, and run each of those responses through another model to determine how suitable the response is based on other criteria such as how on-topic things are, and whether it violates certain rules.

> Is this based on an entirely previous creation?

Yes, it's based entirely on its knowledge of basically everything in the world. Basically just like us, except we have personal volition and experience to draw from, and the capability to direct our own experiments and observe the results.

link

theptip 1337 days ago

It turns out that human intelligence has left a detailed imprint in humanity’s written artifacts, and predicting the structure of this imprint requires something similar (perhaps identical, if we extrapolate out to “perfect prediction”) to human intelligence.

Not only that, but the imprint is also amenable to gradient descent, possessing a spectrum from easy- and difficult-to-predict structures.

link