Hacker News new | ask | show | jobs
by maciejzj 60 days ago
AFAIK – the input is (at most basic level) a matrix with L tokens (rows) and d embedding length (cols). The input tokens are initially coded into discrete IDs but they are turned into embeddings by something like `torch.nn.Embedding`. The embedding layer can be thought of as a "lookup table" but it is matrix multiplication learned through gradient descent (adjusted at train time, fixed values at inference time). The length of embedding (d) is also fixed, L is not. If you check out the matrix multiplication formulas for both embedding layer and attention you will notice that they work for any number of rows/tokens/L (linear algebra and rules of matrix multiplication). The context limit is imposed by auxiliary factors – positional encoding and overall model ability to produce coherent output for very long input.

When it comes down to the meaning of "bank" embedding, it cannot be interpreted directly, however, you can run statistical analysis on embeddings (like PCA). If we were to say, the embedding for "bank" contains all possible meaning of this word, the particular one is inferred not by the embedding layer, but via later attention operations that associate this particular token with the other tokens in the sequence (e.g. self attention).

1 comments

This is exactly what I was looking for, thanks!
In this particular case the embedding wouldn't tell you anything about river bank vs any other bank. At that stage of the computation, this info simply isn't encoded yet. That would come from the context, which is later calculated in the attention matrix, i.e. the only place were tokens are cross-computed along the sequence dimension. Bank would have a strong connection to another token (or several ones) that defines its exact meaning in the current context and together they would create a feature vector in an intermediate embedding space somewhere in the deep layers of the model. The embedding space talked about here is just the input/output matrix that compactifies a huge, highly sparse input matrix (essentially just an array of one-hot vectors glued together) into something more compact and less sparse. There's no real theoretical need for this, it just so happens that GPUs suck at multiplying huge sparse matrices. If we ever get LLMs designed to run on CPUs or analog circuits, you might even be able to just get rid of it entirely.