Hacker News new | ask | show | jobs
by YeGoblynQueenne 1276 days ago
Here's a brief reminder of how large language models like GPT-3 work.

First, you train until the cows come home on billions of tokens on the entire web. This is called "pre-training", even though it's basically all of the model's training (i.e. the setting of its parameters, a.k.a. weights).

The trained model is a big, huge table of tokens and their probabilities to occur in a certain position relative to other tokens in the table. It is, in other words, a probability distribution over token collocations in the training set.

Given this trained model, a user can then give a sequence as an input to the model. This input is called a "prompt".

Given the input prompt, the model can be searched (by an outside process that is not part of the model itself) for a token with maximal probability conditioned on the prompt [1]. Semi-formally, that means, given a sequence of tokens t₁, ..., tₙ, finding a token tₙ₊₁ such that the conditional probability of the token, given the sequence, i.e. P(tₙ₊₁|t₁, ..., tₙ), is maximised.

Once a token that maximises that conditional probability is found... the system searches for another token.

And another.

And another.

This process typically stops when the sampling generates an end-of-sequence token (which is a magic marker tautologically saying, essentially, "Here be the end of a <token sequence>", and is not the same as an end-of-line, end-of-paragraph etc token; it depends on the tokenisation procedure used before training, to massage the training set into something trainable-on) [2].

Once the process stops, the sampling procedure spits out the sequence of tokens starting at tₙ₊₁.

Now, can you say where in all this is the "actual reasoning" you are concerned people are still claiming is not there?

____________

[1] This used to be called "sampling from the model's probability distribution". Nowadays it's called "Magick fairy dust learning with unicorn feelies" or something like that. I forget the exact term but you get the gist.

[2] Btw, this half-answers your question. Language models on their own can't even tell that a sentence is finished. What reasoning?