An object in the real world can be located in 3d space. You can say that one representation of that object is as a point in that space; it is embedded in a 3d embedding space.
Of course, those coordinates are not the only way in which the object can be represented, but for a certain problem context, these location coordinates are useful.
Given objects A,B,C, or rather, given their coordinates, one can tell which two are closest to each other, or you can find the point D that is the other point of the parallelogram ... this. In fact, it allows you to do similarity tests like "A:B :: C:D". This is through standard vector algebra.
Now, imagine each word associated with a 100-dimensional vector. You can do the same thing. Amazingly, one can do things like "man:woman ::king: ...." and get the answer "queen", just by treating each word as a vector, and looking up the inverse mapping for vector to word. It almost feels ... intelligent!
This embedding -- each word associated with an n-D vector -- is obtained while training neural nets. In fact, now you have
readymade, pre-trained embedding approaches like Word2Vec.
An Embedding is a n-dimensional vector (think of it as a sequence of n numbers).
During training, each token (or word) gets an Embedding assigned.
Critically, _similar words will get similar embeddings_. And "similar words" could mean both semantically or (as was the example) syntactically ("apple" and "appli").
And being vectors, you can do operations on them. To give the classic example, you could do:
Embedding(`king`) + Embedding(`female`) = Embedding(`queen`).
Imagine you think of 2 numbers to describe a basketball. You give a number for weight (1), and redness (0.7). Now, a basketball can be described by those 2 numbers, (1, 0.7). That is an embedding of a basketball in 2d space. In that coordinate system a baseball would be less heavy and less red, so maybe you would embed it as (0.2, 0.2).
basketball ==> (1.0, 0.7) # heavier, redder
baseball ==> (0.2, 0.2) # less heavy, less red
When an LLM (large language model) is fed a word, it transforms that word into a vector in n-dimensional space. For example:
basketball -> [0.5, 0.3, 0.6, ... , 0.9] # Here the embedding is many, many numbers
It does this because computers process numbers not words. These numbers all represent some property of the word/concept basketball in a way that makes sense to the model. It learns to do this during it's training, and the humans that train these models can only guess what the embedding mappings it's learning actually represent. This is the first step of what a LLM does when it processes text.
I have no idea if these concepts are similar, but as a machine learning beginner, I found the concept of a "perceptron" [1] to be useful in understanding how networks get trained. IIRC a perceptron can be activated or not activated by a particular input depending on the specific network-under-training between the two. What it means to be activated or not depends on that perceptron's overall function. That perceptron is like a single "cell" of the larger matrix, maybe like the cells in your brain.
When I read the GP description referring to "embedding" above I thought of the perceptron.
Definitely not supernatural at all. The act of making an automaton that "can perceive" feels to me like it's closer to the opposite. Taking that which might seem mystical and breaking it down into something predictable and reproducible.
Of course, those coordinates are not the only way in which the object can be represented, but for a certain problem context, these location coordinates are useful.
Given objects A,B,C, or rather, given their coordinates, one can tell which two are closest to each other, or you can find the point D that is the other point of the parallelogram ... this. In fact, it allows you to do similarity tests like "A:B :: C:D". This is through standard vector algebra.
Now, imagine each word associated with a 100-dimensional vector. You can do the same thing. Amazingly, one can do things like "man:woman ::king: ...." and get the answer "queen", just by treating each word as a vector, and looking up the inverse mapping for vector to word. It almost feels ... intelligent!
This embedding -- each word associated with an n-D vector -- is obtained while training neural nets. In fact, now you have readymade, pre-trained embedding approaches like Word2Vec.
https://www.tensorflow.org/tutorials/text/word2vec