Hacker News new | ask | show | jobs
by n2d4 973 days ago
In essence, an embedding vector is (lossy) compression. Any compression could in theory be used to make such vectors, for example people have tried using gzip embeddings.

Now on how to get a compression vector from an LLM, simplified: Most ML models are built from different layers, executed one after another. Some of the layers are bigger, some are smaller, but each has a defined in- and output. If a layer's input size is smaller than model's input size, that must mean (lossy) compression must have happened to get there. So, you just evaluate the LLM on whatever you want to embed, and take the activation at the smallest layer input, and that's your embedding vector.

Not every compression vector makes for good semantic embeddings (which requires that two similar phrases are next to each other in the embedding space), but because of how ML models work, this tends to be the case empirically.

1 comments

How do you choose the size of the embedding vector?

Can this be used to compress non-text sequences such as byte strings?

1. Usually it's a multi-way tradeoff between how much data you want to use, how much compute you want to spend, how much time you have available, how much training data you have available and how accurate you want the embeddings to be.

2. Yes, but lossily. Some types of byte strings are such that it doesn't matter if you accidentally change a couple of bits, some types of byte strings cannot tolerate that at all without being hopelessly corrupted. This technique is not a magic card to surpass the limits imposed by information theory, it's "just" a more sophisticated dictionary for your compression algorithm.

Regarding the second question, yes, as long as you can train a machine learning model to learn the semantics. The keyword if you want to look into this is Autoencoders.