| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 26fingies 971 days ago
	Admittedly i more or less skimmed and plan on going back over this tomorrow, but I dont see how these vectors are actually created. I get that I could use your llm tool or whatever, but that seems unsatisfactory. How is the sausage made? (or if thats explained can someone point me at the right place to look?)

4 comments

n2d4 971 days ago

In essence, an embedding vector is (lossy) compression. Any compression could in theory be used to make such vectors, for example people have tried using gzip embeddings.

Now on how to get a compression vector from an LLM, simplified: Most ML models are built from different layers, executed one after another. Some of the layers are bigger, some are smaller, but each has a defined in- and output. If a layer's input size is smaller than model's input size, that must mean (lossy) compression must have happened to get there. So, you just evaluate the LLM on whatever you want to embed, and take the activation at the smallest layer input, and that's your embedding vector.

Not every compression vector makes for good semantic embeddings (which requires that two similar phrases are next to each other in the embedding space), but because of how ML models work, this tends to be the case empirically.

link

optimalsolver 971 days ago

How do you choose the size of the embedding vector?

Can this be used to compress non-text sequences such as byte strings?

link

WJW 971 days ago

1. Usually it's a multi-way tradeoff between how much data you want to use, how much compute you want to spend, how much time you have available, how much training data you have available and how accurate you want the embeddings to be.

2. Yes, but lossily. Some types of byte strings are such that it doesn't matter if you accidentally change a couple of bits, some types of byte strings cannot tolerate that at all without being hopelessly corrupted. This technique is not a magic card to surpass the limits imposed by information theory, it's "just" a more sophisticated dictionary for your compression algorithm.

link

n2d4 971 days ago

Regarding the second question, yes, as long as you can train a machine learning model to learn the semantics. The keyword if you want to look into this is Autoencoders.

link

Olreich 971 days ago

The “secret sauce” is Word2Vec. Take a snapshot of all the text you can find on the internet, then with a sliding context window, vectorize each word based on what words are around it. The core assumption is that words with similar context have similar meaning. How you decide which components represent which words in the context is unclear, but it looks like we’re doing some kind of ML training to convince a computer to decide for us. Here’s a paper about the technique which might help: https://arxiv.org/pdf/1301.3781.pdf

Once all the words have vectors you can assume that there’s meaning in there and move on to trying to math these vectors against each other to find interesting correlations. It looks like the scoring for the initial training is based on making the vectors computable in various ways, so you can likely come up with a comparability criteria different than the papers use and get a more useful vectorization for your own purposes. Seems like cosine similarity is good enough for most things though.

link

PeterisP 971 days ago

The general idea is that you have a particular task & dataset, and you optimize these vectors to maximize that task. So the properties of these vectors - what information is retained and what is left out during the 'compression' - are effectively determined by that task.

In general, the core task for the various "LLM tools" involves prediction of a hidden word, trained on very large quantities of real text - thus also mirroring whatever structure (linguistic, syntactic, semantic, factual, social bias, etc) exists there.

If you want to see how the sausage is made and look at the actual algorithms, then the key two approaches to read up on would probably be Mikolov's word2vec (https://arxiv.org/abs/1301.3781) with the CBOW (Continuous Bag of Words) and Continuous Skip-Gram Model, which are based on relatively simple math optimization, and then on the BERT (https://arxiv.org/abs/1810.04805) structure which does a conceptually similar thing but with a large neural network that can learn more from the same data. For both of them, you can either read the original papers or look up blog posts or videos that explain them, different people have different preferences on how readable academic papers are.

link

lelanthran 971 days ago

I second this question. Not that the article wasn't helpful, but I think of it as a good start - now how do I generate that array of floats?

link