| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gushogg-blake 60 days ago

I haven't found an explanation yet that answers a couple of seemingly basic questions about LLMs:

What does the input side of the neutral network look like? Is it enough bits to represent N tokens where N is the context size? How does it handle inputs that are shorter than the context size?

I think embedding is one of the more interesting concepts behind LLMs but most pages treat it as a side note. How does embedding treat tokens that can have vastly different meanings in different contexts - if the word "bank" were a single token, for example, how does embedding account for the fact that it can mean river bank or money bank? Do the elements of the vector point in both directions? And how exactly does embedding interact with the training and inference processes - does inference generate updated embeddings at any point or are they fixed at training time?

(Training vs inference time is another thing explanations are usually frustrating vague on)

2 comments

maciejzj 60 days ago

AFAIK – the input is (at most basic level) a matrix with L tokens (rows) and d embedding length (cols). The input tokens are initially coded into discrete IDs but they are turned into embeddings by something like `torch.nn.Embedding`. The embedding layer can be thought of as a "lookup table" but it is matrix multiplication learned through gradient descent (adjusted at train time, fixed values at inference time). The length of embedding (d) is also fixed, L is not. If you check out the matrix multiplication formulas for both embedding layer and attention you will notice that they work for any number of rows/tokens/L (linear algebra and rules of matrix multiplication). The context limit is imposed by auxiliary factors – positional encoding and overall model ability to produce coherent output for very long input.

When it comes down to the meaning of "bank" embedding, it cannot be interpreted directly, however, you can run statistical analysis on embeddings (like PCA). If we were to say, the embedding for "bank" contains all possible meaning of this word, the particular one is inferred not by the embedding layer, but via later attention operations that associate this particular token with the other tokens in the sequence (e.g. self attention).

link

gushogg-blake 60 days ago

This is exactly what I was looking for, thanks!

link

sigmoid10 60 days ago

In this particular case the embedding wouldn't tell you anything about river bank vs any other bank. At that stage of the computation, this info simply isn't encoded yet. That would come from the context, which is later calculated in the attention matrix, i.e. the only place were tokens are cross-computed along the sequence dimension. Bank would have a strong connection to another token (or several ones) that defines its exact meaning in the current context and together they would create a feature vector in an intermediate embedding space somewhere in the deep layers of the model. The embedding space talked about here is just the input/output matrix that compactifies a huge, highly sparse input matrix (essentially just an array of one-hot vectors glued together) into something more compact and less sparse. There's no real theoretical need for this, it just so happens that GPUs suck at multiplying huge sparse matrices. If we ever get LLMs designed to run on CPUs or analog circuits, you might even be able to just get rid of it entirely.

link

GistNoesis 60 days ago

Typically the input of a LLM is a sequence of tokens, aka a list of integer between 0 and max number of tokens.

The sequence is of variable length. It was one of the "early" problem in sequence modelling : how to deal with input of varying length with neural networks. There is a lot of literature about it.

This is the source of plenty of silent problems of various kind :

- data out of distribution (short sequence vs long sequences may not have the same performance )

- quadratic behavior due to data copy

- normalization issues

- memory fragmentation

- bad alignment

One way of dealing with it is by considering a variable length sequence as a fixed sized sequence but filling with zeros the empty elements and having some "masks" to specify which elements should be ignored during the operations.

----

Concerning the embedding having multiple semantic meaning, it is best effort, all combinations of behavior can occur. The embedding layer is typically the first layer and it convert the integer from the token into a vector of embedding dimension of floating point numbers. It tries its best to separate the meaning to make the task of the subsequent layers of the neural network easier. It's shovelling the shit it can't handle down to road for the next layers to deal with it.

For experiments you can try to merge two tokens into one or into <unknown> token, in order to free some token for special use without having to increase the size of the vocabulary.

Embeddings some times can be the average of the disambiguated embeddings. Some times can be their own things.

In addition to embeddings, you can often look at the inner representation at a specific depth of the neural network. There after a few layers the representation have usually been disambiguated based on the context.

The last layer is also specially interesting because it is the one used to project back to the original token space. Sometimes we force the weights to be shared with the embedding layer. This projection layer usually can't use context so it must have within itself all necessary information to very simply map back to token space. This last representation is often used as a full sequence representation vector which can be used for subsequent more specialized training task.

Embedding weights are fixed after training, but in-context learning occur during inference. The early tokens of the prompt will help disambiguate the new tokens more easily. For example <paragraph about money> bank vs <paragraph about landscape> bank vs bank will have the same input embedding for the bank token, but one or two layer down the line, the associated representation will be very different and close to the appropriate meaning.

link

gushogg-blake 60 days ago

Exactly what I was looking for, thanks!

link