|
|
|
|
|
by n2d4
973 days ago
|
|
In essence, an embedding vector is (lossy) compression. Any compression could in theory be used to make such vectors, for example people have tried using gzip embeddings. Now on how to get a compression vector from an LLM, simplified: Most ML models are built from different layers, executed one after another. Some of the layers are bigger, some are smaller, but each has a defined in- and output. If a layer's input size is smaller than model's input size, that must mean (lossy) compression must have happened to get there. So, you just evaluate the LLM on whatever you want to embed, and take the activation at the smallest layer input, and that's your embedding vector. Not every compression vector makes for good semantic embeddings (which requires that two similar phrases are next to each other in the embedding space), but because of how ML models work, this tends to be the case empirically. |
|
Can this be used to compress non-text sequences such as byte strings?