| > My best guess is that the embedding represents the meaning of the natural language string it was generated from, such that strings with "similar" embeddings have similar meaning. But that's just speculation. Yeah, you've got it. A mapping from words to vectors such that semantic similarity between words is reflected in mathematical similarity between vectors. An idea of how you might train this thing: lets say the words "king" and "queen" are being embedded. In your training data there are lots of examples where "king" and "queen" are interchangeable, for example in the sentence "The ___ is dead, long live the ____", either word is appropriate in either slot, so each time we see an example like this we nudge "king" and "queen" a little closer together in some sense. However you also find phrases where they are not interchangeable, such as "The first born male will one day be ____". So when you see those examples you nudge "king" a little closer in some sense to other words which appropriately complete the sentence (which does not include "queen" in this case). In this way, repeated over a giant training set with thousands of words, concepts like "male/female" and "royalty", "person/object" and tons of others end up getting reflected in the relationships between the vectors. These vectors are then useful representations of words to ML models. |
Starting with: what do you store in it?
Maybe sentence/vector pairs. But what does that give you? What do you do with that data algorithmically? What's the equivalent of a SELECT statement? What's the application that benefits an end user? That part still seems rather hazy.