Hacker News new | ask | show | jobs
by jameshart 1128 days ago
The result is actually richer than ‘predicted output’ - it’s a probability distribution over all possible output.

Having richer ways to consume that probability distribution than just ‘take the most likely thing, after adding some noise’ is more conducive to using LLMs to generate output that can be further processed - in rigorous ways. Like by running it through a compiler.

Think about how when you’re coding, autocomplete suggestions help you pick the right ‘next token’ with greater accuracy.

2 comments

The result is actually richer than ‘predicted output’ - it’s a probability distribution over all possible output.

-- This is, uh, false. If an LLM output a "probability distribution over all possible output", it would be producing a huge, a vast, vector each time. It doesn't. ChatGPT, GPT-3 etc produce a string output, that's it. You can say it's following a probability distribution of outputs from output space but just about anything the output does that.

Think about how when you’re coding, autocomplete suggestions help you pick the right ‘next token’ with greater accuracy.

-- Uh, you missed where I said "in-context predicted output". The Transformers architecture is where the LLM magic happens. It's what allows "X but in pig Latin" etc.

It's hard to get that these systems are neither "fancy autocomplete" nor AGI/something magic but an interest but sometimes deceptive middle ground.

ChatGPT and GPT are APIs over LLMs.

The huge vector is what the neural net outputs. ‘Sampling’ is the process whereby a token is selected.

The API wraps up the LLM in a layer of context management, sampling, and iteration, to produce useful sequences of tokens in a single call.

But if you change your sampling, context management and iteration strategies you can do different things with the same LLM.

Note that for any fine-tuned models (like GPT-4, where the foundation model has not been made accessible) the model does no longer give the "probabilities" of the next tokens, but rather their "goodness". Where the numbers say how good a token would be relative to the aims the model inferred from its fine-tuning.
Isn’t that the same thing? The non-fine-tuned models also have assumptions based on corpus and training. I don’t think there’s such a thing as a purely objective probability of the next token.
It's very different. We don't know exactly what the model consideres good after fine-tuning (which can lead to surprising cases of misalignment), while the probability that something is the next token in the training distribution is very clear. I don't know how they measure it, but they can apparently measure the "loss" which (I think) says how close the model is to some sort of real probability.
What I meant was, fine tuning is not substantially different from training. It seems odd to use different words for the resulting systems.
But fine-tuning is very different from (pre)training. Pretreating proceeds via unsupervised learning on massive amounts of data and compute, while fine-tuning uses much smaller amounts, with supervised learning (instruction tuning) and reinforcement learning (RLHF, constitutional AI).
"no longer"??

The deep learning models (of which LLMs and GPTs are a type) have never returned probabilities. Ever. Why do people have that hallucination suddenly?

They do produce probabilities at the end of generator, And they do select a single token for output. With highest probability or somehow randomized.

So, end users see only one value. But with access to internals all high value variants can be considered. The easy way to do it is to select one, save the state. Look forward and roll back to saved state. Try another token. Select the best output. The smart way is to do it only at key points, where it matters the most. Selecting those points is a different task. May be another model.

The probabilities (in form of log odds) can be directly accessed in the OpenAI playground, I believe. The "try again" approach would only work for temperature = 0, when the model always returns the tokens with the given probabilities. For temperature = 1 it always returns the token with the highest probability. Usually they use something like temperature 0.8 in ChatGPT, I think, which still biases the model toward the more likely tokens. In the playground the temperature can be set manually. (Again, for fine-tuned models, which are the majority, those are numbers are not probabilities but "goodnesses".)
Okay why is this downvoted? wtf
upvoting a bit. my guess we have here anti-AI vigilantes. Actually it's not a guess anymore, and not something new in general.
You can literally fire up the openai playground and ask gpt3 to give you all alternate token probability