Hacker News new | ask | show | jobs
by throw310822 1125 days ago
I have a very dumb question, I'll just throw it here: I understand word embeddings and tokenisation- and the value of each; but how can the two work together? Are embeddings calculated for tokens, and in that case, how useful are they, given that each token is just a fragment of a word, often with little or no semantic meaning?
2 comments

I've heard that nowadays subword/token embeddings are learned during the training phase, and that they are useful for reconstructing the embeddings of words that contain them, and in fact allow the model to handle typos like "aple" (instead of "apple").
The way transformers operate is by transforming the embedding space through each layer. You could say that all the "understanding" is happening in that high dimensional space - that of a single token, but multiplied by the number of tokens. Seeding the embedding space with some learned value for each token is helpful. Think of it as just a vector database: token -> vector.

Decoder-only architectures (such as GPT) mask the token embedding interaction matrix (attention) such each token embedding and all subsequent transformations only have access to preceeding token embeddings (and transforms). This means that on output, only the last transformed token embedding has the full information of the entire context - and only it is capable of making predictions for the next token.

This is done so that during training, you can simultaneously make 1000s (context size) of predictions - every final token embedding transform is predicting the next token. The alternative (Encoder architecture, where there is no masking and the first token can interact with the final token) would result in massively inefficient training for predicting the next token as each full context can only make a single prediction.

Disclaimer - someone from Marqo here.

Marqo supports E5 models: https://github.com/marqo-ai/marqo