| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by whoateallthepy 1155 days ago

This is a great set of comments/questions! To try and answer this a bit briefly:

The input string is tokenized into a sequence of token indices (integers) as the first step of processing the input. For example, "Hello World" is tokenized to:

  [15496, 2159]

The first step in a transformer network is to embed the tokens. Each token index is mapped to a (learned or fixed) embedding (a vector of floats) via the embeddings table. The Embeddings module from PyTorch is commonly used. After mapping, the matrix of embeddings will look something like:

  [[-0.147, 2.861, ..., -0.447],
   [-0.517, -0.698, ..., -0.558]]

where the number of columns is the model dimension.

A single transformer block takes a matrix of embeddings and transforms them to a matrix of identical dimensions. An important property of the block is that if you reorder the rows of the matrix (which can be done by reordering the input tokens), the output will be reordered but otherwise identical too. (The formal name for this is permutation equivariance).

In problems related to language it seems inappropriate to have the order of tokens not matter, so to solve for this we need to adjust the embeddings of the tokens initially based on their position.

There are a few common ways you might see this done, but they broadly work by assigning fixed or learned embeddings to each position in the input token sequence. These embeddings can be added to our matrix above so that the first row gets the embedding for the first position added to it, the second row gets the embedding for the second position, and so on. Now if the tokens are reordered, the combined embedding matrix will not be the same. Alternatively, these embeddings can be concatenated horizontally to our matrix: this guarantees the positional information is kept entirely separate from the linguistic (at the cost of having a larger model dimension).

I put together this repository at the end of last year to better help visualize the internals of a transformer block when applied to a toy problem: https://github.com/rstebbing/workshop/tree/main/experiments/.... It is not super long, and the point is to try and better distinguish between the quantities you referred to by seeing them (which is possible when embeddings are in a low dimension).

I hope this helps!

2 comments

Buttons840 1155 days ago

> Alternatively, these embeddings can be concatenated horizontally to our matrix: this guarantees the positional information is kept entirely separate from the linguistic (at the cost of having a larger model dimension).

Yes, the entire description is helpful, but I especially appreciate this validation that concatenating the position encoding is a valid option.

I've been thinking a lot about aggregation functions, usually summation since it's the most basic aggregation function. After adding the token embedding and the positional encoding together, it seems information has been lost, because the resulting sum cannot be separated back into the original values. And yet, that seems to be what they do in most transformers, so it must be worth the trade-off.

It reminds me of being a kid, when you first realize that zipping a file produces a smaller file and you think "well, what if I zip the zip file?" At first you wonder if you can eventually compress everything down to a single byte. I wonder the same with aggregation / summation, "if I can add the position to the embedding, and things still work, can I just keep adding things together until I have a single number?" Obviously there are some limits, but I'm not sure where those are. Maybe nobody knows? I'm hoping to study linear algebra more and perhaps I will find some answers there?

link

whoateallthepy 1155 days ago

One thing to bear in mind is that these embedding vectors are high dimensional, so that it is entirely possible that the token embedding and position embedding are near-orthogonal to one another. As a result, information isn't necessarily lost.

link

zwaps 1155 days ago

The information might be formally lost for the given token, but remember that transformers train on huge amounts of data.

The (absolute) positional encoding is an arbitrary but fixed bias (push into some direction). The word "cat" at position 2 is pushed into the 2-direction. This "cat" might be different from a "cat at position 3, such that the model can learn about this distinction.

Nevertheless, the model could also still learn to keep "cats" at all positions together, for instance such "cats" are more similar to "cats" than to "dogs" at any position. More importantly, for some words, the model might learn that a word at the beginning of the sequence should have an entirely different meaning than the same word at the end of the sequence.

In other words, since the embeddings are a free parameter to be learned (usually both as embeddings, and weight-tied in the head), there isn't any loss in flexbility. Rather, the model can learn how much mixing is required or whether the information added by the positional embedding should be seperable (for instance by making embeddings linearly independent otherwise)

If you concat, you carry along an otherwise useless and static dimension, and mixing it into the embeddings would be the very first thing the model learns in layer 1.

link

dist-epoch 1155 days ago

> The input string is tokenized into a sequence of token indices (integers)

How is this tokenization done? Sometimes a single word can be two tokens. My understanding is that the token indices are also learned, but by whom? The same transformer? Another neural network?

link

montebicyclelo 1155 days ago

Huggingface have good guides on tokenization, and tokenizer training. BPE (e.g. used by gpt) and wordpiece (e.g. used by bert) are two commonly used methods https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt

link

whoateallthepy 1155 days ago

The tokenization is done by the tokenizer which can be thought of as just a function that maps strings to integers before the neural network. Tokenizers can be hand-specified or learned, but in either case this is typically done separately from training the model. It is also less frequently necessary unless you are dealing with an entirely new input type/language.

Tokenizers can be quite gnarly internally. https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt is a good resource on BPE tokenization.

link